Improving Performance and Reliability: Edge Golang Overhaul

We've just implemented a new version of our edge router (aka "Styx") that handles many millions of dynamic requests per day. The new version has been routing traffic successfully for a few weeks and appropriately responding to errors in networking and on application servers.

Why is this important? Our edge routes requests to application containers in the runtime matrix, dynamically, wherever they may be at the moment. It also routes around inaccessible application containers. This overhaul provides improved performance and reliability on the Pantheon platform.

[Related] How Agencies Benefit from Pantheon High Performance Web Hosting

What We Were Using Before

Our previous version of the edge was based on node.js. This version served us well over the years, but had a few drawbacks.

  1. As we added code to manage sites with more complex requirements, we started falling into node.js “callback hell.”

  2. We developed tactics to mitigate problems in the presence of networking errors and unresponsive application servers, but these tactics left more to be desired.

  3. Increasing the complexity to further improve uptime and performance led us back to problem #1.

Components of Success

Any major overhaul of a critical system is risky. We succeeded because of measures we put into place to allow us to gain confidence before releasing to customers, to roll out in a controlled fashion with steps in place to roll back in the case of issues, and to increase the visibility of the new "StyxGo" in operation so that we could confirm its proper execution or detect issues and remediate quickly.

What we used:

  • Enhanced testing
  • Development environment
  • Controlled rollout
  • Tailored logging
  • Multi-faceted graphs
  • Profiling with pprof

Testing: The new version of Styx is the most well-tested component in the Pantheon stack. We wrote dozens of tests which run in CircleCI on each commit and which we run independently and add to with each change of the software. Use of coveralls.io indicates our tests provide 86% code coverage.

Development Environment: We maintain a multiform development environment which increased our confidence in our code before we deployed. This development environment includes a ‘onebox’ environment which contains all the elements of the Pantheon platform in one sandbox, allowing us to make code changes, build a new binary, and execute while monitoring logs for expected or unexpected behavior.

We also have a set of pre-production components in our production environment. These production-yet-not-production components allow us to inject errors to see the results in a real environment without adversely effecting customers. We used this environment, for instance, to perform server reboots while analyzing Styx’s failure detection mechanism.

Controlled rollout: We architected a mechanism for a controlled rollout to our production environment in a way that largely mitigated the risk of overhauling the primary edge router. Elements of this architecture included configuring edge servers to contain both the original node.js "Styx" version and the new golang one. By tweaking the edge cache layer—which sits "in front" of the routing layer—we were able do two things: route a list of specific domains to the new version on the box, and use a cache director to split traffic on a percentage basis between the two versions.

Logs: We constructed a logging module that collected individual entries on a request as it was processed, but only wrote to the logs if an anomaly was detected, in which case the entries were compiled as a single log entry for easier tracking. This also reduced the amount of logging to enable us to more easily focus on the anomalies.

Graphs: We built a large number of graphs which updated continually during execution on our graph dashboard and provided visibility into the health of the system and potential trouble spots as the new code executed. In addition to graphs measuring the more usual request attempts, cache statistics, and errors, we also measured memory, heap usage, and garbage collection statistics.

Profiling: Finally, we developed methods of profiling using pprof for cpu and heap analysis to give us deeper insight into issues as our other mechanisms revealed them. This profiling was critical to finding the last of our issues which were holding us back from complete deployment.

Observations

We’ve been quite happy with our decision to rewrite our edge in Go, with the standard Go libraries and with the third party packages we included (Gary Burd’s redigo, coreos’ go-systemd, Matt Reiferson’s go-httpclient, Peter Bourgon’s g2s). It has allowed us the ability to write professionally-crafted code and given us insight into and control over more aspects of the routing process than we felt we had with node.js.

One of the unsung heroes in the process was our attention to mechanisms to quickly roll back as we monitored new rollouts and detected issues. Remediation was a big part of our success.

Creating a monitoring infrastructure with logs, graphs, and profiling was critical to rolling out new versions and assuring ourselves that the rollout was succeeding.

This release will increase performance and uptime for all customers on the platform, another example of how we're constantly innovating to build a better internet for everyone. 

Topics Speed & Performance