How Do I Know It's Working: Disaster Recovery Edition

An illustration of Disaster Recovery architecture syncing from one Google Cloud data center to another

Pantheon likes to hide implementation details. We think that developers can work faster when we're not overwhelmed with configuration options. We run thousands of sites in Google Cloud with essentially the same MySQL, PHP, and nginx settings for all of them. There are a few variables like memory that get larger values as you move to higher plan levels.

One benefit of this strategy is that we can add new features, or make architectural changes, for all sites together. Two years ago Pantheon engineers lifted and shifted nearly our entire infrastructure to Google Cloud. It was a massive engineering effort to migrate thousands of live sites without triggering any downtime. We know it was successful because pretty much nobody noticed. Sites just got faster, more stable, and more reliable.

At the same time as the Google Cloud migration, a different team of engineers added a Global CDN in front of all Pantheon sites. Again, sites got faster, more reliable, and HTTPS came baked in. Developers did not need to change the way they were building sites to take advantage of this feature either. For developers interested in seeing how the more esoteric features work, like Pantheon Advanced Page Cache, you need to dig deep.

Now we are rolling out a Disaster Recovery (DR) feature that does all its work under the surface. Again, you don't need to change anything about your site. Just talk to a salesperson to add on DR and suddenly traffic on your site will split across two Google Cloud data centers. The vast majority of traffic will go to a primary data center (let's call it Zone A) and a tiny fraction will occasionally go to a different data center (Zone B) to keep hot the containers running your site there. If there is ever a problem with the health of Zone A then all traffic will be diverted to Zone B while a new Zone C spins up to be the new failover.

That sounds impressive. But how do I know that it is working when the details are hidden? To answer that question, I recently did a couch coding webinar with one of our sales engineers, James Rutherford. I recommend watching the video recording if you're interested in getting the whole story.

Here's a summary of how we showed that DR worked.

How Do I Know There Are Containers?

Pantheon's higher tier offerings have always come with multiple load balanced containers. Disaster Recovery increases complexity by running multiple containers in multiple data centers. To see this in action I stepped through the three environments included on all Pantheon Sites: Dev, Test, and Live. For a site with Disaster Recovery enabled, the number of containers increases in each environment. Dev has one container, Test has two containers, and the Live environment has a variable number of containers mirrored across the two data centers.

As a way of seeing the fact that there are containers, I used "Devel" module in Drupal to print out a global variable that shows the name of the "application container" as it is called internally. We're getting into "don't try this at home" territory here since we don't recommend ever relying on these specific names over the long term, or short term for that matter. Pantheon is architected in a way that treats these "application containers" like cattle, not pets.

Running debugging code in the Test environment shows two different application container names

Running debugging code like this in the Test environment allows me to see two different application container names as I hit refresh. But what if I wanted to record the number of times each application container is used? If I'm to believe traffic is getting evenly balanced between these containers, I'd like to see a distribution over time.

How Do I know there are Multiple Containers?

I should mention at this point that I'd like to do this investigation with as little modification to a Drupal site as possible. I'm using the Umami demo profile that comes in Drupal 8 and one technique that occurs to me for tracking data over time is the content types that install profile includes. What if I could somehow record into article nodes the name of the application container that was used on the article creation form?

Article creation form in Drupal 8

With a terribly ugly preprocess function I'm altering the input element for article tags to simply include the name of the application container used to respond to the request. Now to see the split between application containers, I just need to make a bunch of nodes. Sure, I could do that manually, but since I've already put on my mad scientist hat for this investigation how about I script it? To do that I happened to use a tool called Cypress. One of the handy aspects of Cypress is that it can record video of the processes it runs.

 

Now with a script making a ton of nodes, I can look at a View that shows me that indeed I have nodes coming in that are tagged with these two application containers.

A View showing there are nodes coming in that are tagged with the two application containers
 

How Do I Know That Some Containers Are Getting the Vast Majority of Traffic?

Ok, so far I have not verified the DR part. I've seen that the Test environment has multiple containers in play. Now to see the new part provided by Disaster Recovery, I need to see that some containers are getting the vast majority of traffic, something like 99.5%, while another set of containers gets almost no traffic at all.

A diagram showing that some containers are getting the vast majority of traffic, while another set of containers gets almost no traffic at all

 

I need graphs. And I need a lot of nodes. So in the live site, I spun up three instances of my Cypress.io script. I used some charting integrations with Views module. And look at that, the vast majority of articles are getting tagged in one set of application containers, while a small fraction is going to another set.

Charting integrations with Views module showing the vast majority of articles are getting tagged in one set of application containers, while a small fraction is going to another set

Failover

As soon as the failover starts, all the new nodes are getting tagged in the Zone B containers. By looking at nodes created most recently I can see that the number of requests made to Zone B is on its way to surpassing Zone A.

A chart view of nodes created most recently showing the number of requests made to Zone B is on its way to surpassing Zone A

Find Out More

Seeing the deep layers of Pantheon in action like this takes some hacking. We design our platform to keep details like this out of your way. If you would like to find out more though and possibly add Disaster Recovery to your site, please contact our sales team.

You may also like:

 
Topics Development, Drupal, Training and Education, Website Technology, WordPress

Let’s get in touch

855-927-9387