Mitigating Meltdown

TL;DR: we beat Meltdown (for now), and thanks to our platform engineers’ work, some Pantheon customers are running faster than before the patches were applied. Spectre is up next.

Meltdown and Spectre are the biggest vulnerabilities ever disclosed: affecting all applications on all operating systems and devices from mobile to desktop to cloud. We’re not out of the woods yet, but we wanted to share the details of this story for the benefit of our customers, and because our findings for some kernel settings may be of use to other others with similar workloads.

Pictured above: the New Relic equivalent of a mic drop.

Pantheon’s heroic infrastructure is made of much more than technology. A badass self-healing container fleet is great, but it still needs pilots. We have one of the best teams in the business.

If you’re a systems engineer and want to work with this kind of team, we’re hiring. If you’re a site owner and want this kind of team to have your back, we’ll help you migrate your sites to our platform.

Incident Retrospective

First, the retrospective. Much of this information was posted to our status page in real time as we dealt with the issues, but an aggregate overview is called for. Also, our findings regarding the Transparent Huge Pages (THP) kernel setting are new and noteworthy.

At the time that the vulnerabilities of Meltdown were disclosed by researchers at Google, we were happy to hear that our underlying infrastructure provider (also Google) had already rolled out patches at the bare metal level. Speed and alacrity when it comes to this kind of response is one of the many reasons we moved to the Google Cloud Platform, so it was nice to see that confirmed.

From there, our team immediately set to work patching the underlying servers for our PHP container matrix, where customer CMS code runs. This is the area of our platform most vulnerable to malicious exploit, so it was urgent to secure. We were able to achieve this with zero customer downtime, and also scaled our fleet of servers to provide additional capacity in anticipation of an expected performance hit.

The nature of Meltdown as a vulnerability means that to be secure likely entails a performance hit. Intel themselves report up to a 25% performance regression on recent hardware. We were planning for something similar.

However, after applying the KPTI patch, which blocks Meltdown attacks, we found that certain workloads within customer sites exhibited significantly higher slowdowns, much worse than expected, leading to volatile performance. Even with the extra fleet capacity, load spikes were happening frequently.

The THP Discovery

After collecting data from customers via our support team using New Relic, our platform engineers took a look at things at the next level down using the perf kernel tool. In doing so, we discovered what we believe to be a significant performance bottleneck in Linux’s Transparent Huge Pages (THP) after patching. Disabling THP immediately returned performance to expected levels, and in some cases left sites faster than they’d been prior to patching.

This was unexpected. THP is a kernel internal designed to increase memory efficiency. It is not something that would impact context-switching between user space (where applications run) and the kernel, which is what KPTI is intended to secure.

However, the patch does re-work some internals to secure the user/kernel barrier, and we speculate that this has introduced a performance regression, specifically around collection/defragmentation of memory. More research is warranted here, as this regression may be preventable with additional work.

Without THP Pantheon’s systems consume more memory overall, which will factor into our infrastructure planning going forward. We’re happy to make this tradeoff to deliver stable and speedy CPU performance for our customers.

We still have a ways to go as more patches are expected in the future, and potentially even the recompilation of many packages to mitigate Meltdown’s sister vulnerability, known as Specter. Fully mitigating both is a multi-month endeavor. We remain committed to delivering the absolute pinnacle of both speed and security for all our customers.

The Story From Inside

The beginning of 2018 was one of the wilder fortnights in Pantheon engineering-land. Beyond the facts of the incident retrospective, we also wanted to share the story of how we handled it as the team experienced things. I (the author) was only an outside spectator to the mitigation, which mostly played out over our internal infrastructure channel on Slack, but I was mightily impressed and felt the story should be told.

There were some indications as early as mid-December that “something was coming,” based on chatter from the mailing lists, Reddit, and people starting to dissect patches that were being passed around for testing and review. We had no idea when, but it was clear that we might be looking having to pull a barrel roll—a platform-wide refresh—sometime in January.

But this isn’t our first rodeo. In addition to the tooling we’ve developed to handle day to day operations for hundreds of thousands of live containers, over the past six years we’ve also created a number of emergency playbooks and procedures for when rapid maneuvers are required. We’re confident at this point we can turn the whole fleet over (hence “barrel roll”) with minimal customer impact.

We deploy patches on a rolling basis, setting up clean container servers and draining application workload off the unpatched systems to eliminate downtime. For this fix, we also upgraded the kernel to have access to the best patches available rather than going with a long-term maintenance line.

To explain the impact of the rollout, it’s important to understand that Pantheon’s day to day operations are managed semi-autonomously. Our fleet of servers has a suite of services we lovingly call “Dragons,” which ship workload away from areas with memory or CPU pressure to keep the whole platform balanced. Trying to manually coordinate at our scale would be impossible, so we don’t.

The immediate effect of rolling out the patches is that our primary load metric went up about 20%. We have a composite internal number that includes CPU utilization, memory, network, disk usage, and disk performance, and it went from a median of 150 to 180, which is right smack in the range of what we were expecting. The plan was to mitigate this by growing the fleet. But after the extra servers were online, the load didn’t go down.

In a post-Meltdown world, even after adding capacity, the “Dragons” are going wild. We’re seeing over a hundred container migrations a minute, while loads are continuing to spike across the fleet. Instead of a stable-but-slightly-slower platform, which we were prepared to accept in the short term in the name of security, we have unpredictable performance, and are apparently unable to mitigate it by throwing hardware at the problem.

At the same time, our support team is handling a bunch of customers submitting chats and tickets related to poor site speed, and have super-spikey New Relic graphs to prove it. Same in our #power-users Slack Channel. Even our own web team says that they’re getting some really “un-Pantheon” performance while trying to make updates to pantheon.io (which is a Drupal site we run on our own platform, naturally).

As we start digging into the detailed traces, patterns begin to emerge. PHP’s Redis library comes up as a common culprit. Not Redis the server—which is fine—but the PHP side of it specifically. We have a few different theories, initially that some kind of very high-frequency loop in that code (maybe the gzip stuff?) was hit especially hard by the KPTI/Meltdown patch.

At this point we can’t get any deeper with New Relic since it only instruments to the PHP function level, and can’t tell us what’s going on inside that suddenly made it slow. The team starts looking at internals with perf (which is like strace or dtrace) to see what’s going on at the system call level, leading to the epiphany:

“Why is this page fault on a single PHP FPM thread eating 51% of the CPU?”

Deploying The Fix

The syscall stack showed a lot of page memory faults. This is in and of itself not abnormal. When there’s not room in the current memory page the collector tries to scan to see where it can allocate, and if it can’t find one it’ll defrag memory to make a new page.

On a system with a lot of active threads (like a Pantheon container fleet server), this kind of memory re-allocation is happening a lot in kernel-land. That’s normal. Post-patching, it started taking a lot more CPU to compute, which isn’t.

THis initially didn’t make any sense to us since that seemed to be orthogonal to the Meltdown vulnerability and the KPTI patch. However, following the data, we tested disabling Transparent Huge Pages (THP) and performance improved immediately. We also looked at whether or not our kernel minor version upgrade mattered, and the outcome was the same. With Meltdown patched via KPTI, the THP collector seems to have a real performance regression.

And if that’s the case, the functions that stood out in New Relic start to make a lot more sense. We saw PHP frequently “hanging” on calls like redis_get_multiple or mysql_fetch_assoc or even wp_load and drupal_load. The common thread in all of these is they involve memory I/O, whether that’s data coming in over the network from a database, or using an opcode cache to execute application code.

Usually, moving things around in memory is among the fastest thing you can do with a computer. It turned out that with THP enabled, under our workload that was no longer the case.

Since disabling THP, we are actually seeing many customers enjoying faster overall CPU performance. It seems that even with the “old” fast performance, the overhead of all that garbage collection was material. The tradeoff of increased system memory usage on our end means we’ll need to capacity plan differently in 2018, but this is an easy call for us: customer experience wins, hands down.

Finally, while we have a happy outcome from Meltdown, there’s still a long road ahead with Specter, which is currently being batted around between OS developers and Intel. No matter how it shakes out, the mitigations will quite likely be more complex than a simple patch, and could take take months to complete. Stay tuned to our Status Page for ongoing updates on our progress.

Topics Security, Speed & Performance