Scheduled Maintenance Windows (and how to avoid them all together)

Hosting is dead, and so are downtime windows for web-server Operating System upgrades. Some days at Pantheon as part of our normal infrastructure operations we will spin up or kill dozens of servers and migrate thousands of websites with no customer downtime. Fun fact: currently the average age of servers in our fleet of 250+ is 52 days (see graph below).

How is this possible?

We architected the Pantheon Platform around next-generation technologies and methodologies. One core open-source technology behind our platform is the Fedora distribution of Linux. Fedora provides access to some of the advanced features that fuel our vision for the power of containerized platforms. In addition to cutting-edge packages, Fedora provides another amazing (yet counter-intuitive) benefit: its tight release cycle.

Fedora releases faster than any other major server-centric distribution, with new major versions every 12 months and only one prior release supported at a time. This rapid release cycle has earned Fedora the informal slogan "end-of-life is a way of life", because without truly embracing this rapid release rate, deploying Fedora can indeed cause instability.

Avoiding instability is the impulse behind traditional big-time *NIX's lengthy lifecycles. Sun Microsystems initially set the bar with a decade of support for Solaris, for which Oracle now offers "indefinite" support. RHEL's full lifecycle lasts for 13 years, while Ubuntu's LTS (long term support) edition is guaranteed for five.

At first glance, this seems to be a beneficial and stable choice. However, after the five (or 13, or 20, or...) years are up, there's a massive gap between the technologies and versions of the old production servers and current distributions. Five years in internet-time is enough for a paradigm shift and a few iterations on implementation. If Apple offered a 'very stable' iphone that you couldn't upgrade for five years, would you do it?

Andy Grover, Principal Software Engineer at Red Hat, proposed a workflow for exploiting Fedora's release cycle:

"In a world where instances are deployed constantly, instances are born and die but the herd lives on. Once everyone has their infrastructure encoded into a configuration management system, Fedora's short release cycle becomes much less of a burden. If I have service foo deployed on a Fedora X instance, I will never be upgrading that instance. Instead I'll be provisioning a new Fedora X+1 instance to run the foo service, start it, and throw the old instance in the proverbial bitbucket once the new one works."

Operating systems are only one (albeit important) reason to migrate to new servers, but there are many more. For example, Rackspace recently release their next-generation public cloud infrastructure (http://www.rackspace.com/cloud/openstack/), with many improvements and new features (standardized APIs, less noisy-neighbor impact, etc). Pantheon was able to take advantage of this because of our agile infrastructure and experience with regularly migrating both customer and platform resources to new nodes.

Rackspace — and other infrastructure providers — will inevitable usher out new improvements, and even new hardware, that will offer improved stability, new features or increased performance. We work hard to engineer solutions that allow us to take advantage of these improvements as quickly as possible.

Similarly inevitable are degraded nodes, especially in the cloud. It could be poor network performance, noisy neighbors, bad disk I/O or a number of other factors. The faster we can detect and replace these instances, the sooner we can restore optimal service.

Finally, we often face the challenge of addressing a potential Linux kernel security vulnerability. The traditional workflow for upgrading kernels is to schedule a maintenance window with the affected customers, upgrade the kernel on the system, reboot the server, perform QA after reboot and communicate the status to impacted parties.

At Pantheon, we simply launch new instances with the kernel upgrades, initiate a seamless migration of customer resources to the new endpoint, and de-provision the old instance once it is fully drained of active resources. This migration process occurs with zero downtime for stateless PHP workers (DROPs), and a few seconds of downtime for MySQL server processes.

This graph shows the average age of Application and Database endpoints in our fleet over two weeks. The large drop is the result of mass migrations for Fedora upgrades and kernel patches. Even sites that have been running successfully on Pantheon for two years are actually served off servers provisioned this summer or even this morning.

Pantheon provides a highly consistent, automated platform backed by cutting-edge technologies and agile infrastructure. There are certainly challenges associated with this technical strategy, but we find them to be well worth it to take advantage of the many benefits.

Next time you get an email about a kernel vulnerability, or a degraded cloud instance, or the current version of your Linux distribution reaching end-of-life, ask yourself "Can't somebody else deal with this?". We can. We're Pantheon. We've got your back.

Topics Education