How we patched Heartbleed for 60,000+ Drupal & WordPress sites in 12 hours

Heartbleed is a perfect storm for Internet security. Almost every single Internet-facing server using an unpatched OpenSSL library is exploitable. Some guess that two-thirds of all websites on the Internet were or still are affected.

Taking advantage of our unified, container-based infrastructure, we successfully patched Heartbleed for 60,000+ Drupal and WordPress sites on Pantheon less than twelve hours after the bug was first announced. We were fully patched by 10:15PM on April 7th. Taking advantage of our systemd-based container architecture, we were able to update the OpenSSL library and cleanly restart exactly the right services with close to zero customer impact.

Here’s how we did it (reusable code snippets are included below).

 

Preparation

For an attack like Heartbleed, there are only a few things you can do ahead of time, but these became critical to our real-time response.

First and foremost, we chose a Linux distribution with a good security team. We use Fedora, which has a great security team but a support lifecycle too short to recommend in general. Pantheon worked with Fedora’s security team to first develop a package that disables heartbeat followed by helping QA the main releaseUbuntu and Debian both had rapid responses and offer longer-term support. The attack surface of a modern server is so large that you won’t want the fix to the vulnerability to be your first challenge.

We had good general security practices in place. We use defense in depth and we don’t pack too many operations into the same daemon. On Pantheon, HTTPS termination is isolated from most application code.

We also had a lot of automation in place (Chef and Fabric) which helped us deploy our changes quickly. Our entire response and deployment team for this project was two engineers for twelve hours: our Systems Engineer Kyle Ibrahim and me.

 

The Fix

Distributing the patched OpenSSL package

While our upstream published a way to get the fresh packages without waiting on mirrors, we were worried about hitting that too hard. Our fleet consists of hundreds of servers. We pushed updated packages to Amazon S3 and used Chef to push that out to servers.

But, that’s the easy part. The hard part is ensuring that anything that’s loaded the old OpenSSL gets restarted. Most people just reboot, but not Pantheon.

Safe, idempotent deployment leveraging systemd

It’s pretty easy on Linux to identify processes with a stale shared library. The process maps will indicate which files have been deleted but are still loaded. Here’s a snippit that does that, courtesy Corsac:

grep -l 'libssl.*deleted' /proc/*/maps | tr -cd 0-9\\n | xargs -r ps u

That’ll get you the executable of what’s running, but it doesn’t provide any way to cleanly restart what you find unless you compile your own ruleset mapping executable names to services. This would be especially complicated if, like Pantheon, you may run 5000 nginx instances on a box.

We started with the maps approach and used a trick from systemd’s journal for how it goes from the PID of a logging daemon to the name of the service:

  1. Load /proc/<PID>/cgroup

  2. Find the one in the systemd tree:

    [straussd@olympian systemd]$ sudo cat /proc/21962/cgroup 
    11:hugetlb:/
    10:perf_event:/
    9:blkio:/
    8:net_cls:/
    7:freezer:/
    6:devices:/
    5:memory:/
    4:cpuacct,cpu:/system/polkit.service
    3:cpuset:/
    2:name=systemd:/system/polkit.service

  3. Parse out the unit (service) name. It’s the part after the last slash.

  4. In our case, apply a whitelist to avoid some services like MySQL where it loads OpenSSL but we don’t use the TLS support.

  5. Tell systemctl to restart or reload the service.

  6. Run the check again to verify that the stale library is no longer loaded.

Here’s a generic (runnable anywhere with systemd) version of what Pantheon uses now:library_updates.py.

The great thing about this approach

  • Idempotent: run it as often as you like; it will only restart necessary services

  • Accurate: no guessing involved about which daemons are using an affected library

  • Fast: no reboot

  • Clean: does orderly shutdown and startup for the affected services. For socket-activated services, there may not even be a visible interruption in availability.

  • Generic: works for any shared library

 

Final words

There’s a chance that Heartbleed has already allowed an attacker to compromise private keys on Pantheon infrastructure. We’ll be regenerating our keys, and we're reminding our customers to do so, too. With a compromise like this, it's impossible to know whether OpenSSL leaked the key.

At Pantheon we are long term committed to making websites easy to run and invulnerable to exploits such as a Heartbleed. Website developers: we have your back.

Topics Education, Security
Contact Us or call us at 855-927-9387