Last week we announced New Relic Application Performance Monitoring (APM) Pro as a free add-on for all sites, at all levels of service. This is a huge level up for Pantheon, and fulfills a dream I’ve had for quite some time in terms of the suite of tools we want to provide development teams on the platform.
Pantheon’s strategic value is just that: it’s a platform. You build on top of it. Yes, we have some of the fastest, most secure, and highly scalable elastic hosting on the market, but the things that make Pantheon different are that we offer a consistent set of environments for teams to develop, test, and deploy for all their projects.
New Relic has even more benefit in this context. Our ability to guarantee consistency between environments means developers can see precisely how code will (or won’t) perform and more importantly why in real-time. That’s something that’s never been possible, until now.
Anatomy of a Performance Emergency
Here’s how it works. The hard truth is that performance regressions often manifest in production. Even if you have a solid testing process this can still happen—maybe because of a change in configuration on the live site, or an external API that’s out of your control starts misbehaving, or just the natural growth of the content footprint hitting a tipping point. There’s plenty of code that works great when you have a few thousand pieces of content that will fall over if you 10x or 100x that number.
Whatever the case, let’s assume the worst case scenario: your client is on the phone and upset because the site is crawling. What do you do, hotshot?
The status quo is you start sweating bullets, because the problem could literally be anywhere. Often you end up having to finagle access directly to whatever the production infrastructure is, and start tailing logs and adding debugging code in an attempt to isolate the problem. As frequently happens in emergency situations, attempts to debug the problem could introduce further trouble, especially if there’s cowboy coding happening on the live site.
With a little luck you’ll finally be able to pinpoint the issue and make some kind of fix. Maybe you get a permanent win like refactoring a query to be more efficient, but more often you get a temporary fix like just disabling a problematic plugin, sacrificing some functionality to speed the site back up. While the emergency is over, you now have a chunk of net-new work to do figuring out how to get all your features back while keeping performance at an acceptable level.
The Pantheon + New Relic Way
Performance problems are difficult. They’re complex and high risk/high pressure, which is why developers who are good at handling high performance sites are in such high demand. While Pantheon and New Relic can’t make unexpected calls from upset clients any more pleasant, we do make the response a whole heck of a lot better. Once you’ve assured your stakeholders that you’re on the case, the steps to resolution are clear and safe:
Step 1: Dial up New Relic’s application performance monitoring dashboard. This will immediately give you an indication of where and how bad the problem is, whether it’s in the database, code, or external services.
Step 2: Reproduce the issue in a Multidev environment. Because our infrastructure is consistent, it’s nearly impossible to have problems that “only happen in production.” The performance characteristics of every Multidev environment are the same as production, so you can easily replicate the problem in an environment where it’s safe to debug and fix.
Step 3: Release an initial fix. Once you have a change that’s stable and gets performance where it needs to be, you can deploy it using the Pantheon workflow. The benefit is you never have to risk crashing the live website. If your fix was a long-term winner that’s great: you’re done! If not, you can then go back to Multidev and start working on…
Step 4: Develop lasting performance improvements: you can use Pantheon’s Multidev environments to focus in on key interactions, transactions, and problems, iterating on solutions until you have some really big wins for performance. More than just giving you the tools to respond to a live site emergency, this process lets you built lightning fast sites from the start—guaranteeing a delightful experience for your customers and their users.
Like Peanut Butter and Chocolate
Some things just go together. For me, New Relic is the perfect complement to Pantheon, or vice-versa. It’s one of those cases where the whole is more than the sum of the parts.
It’s great to have a platform that gives you unlimited development environments that are guaranteed to be the same as production, but without deep vision into the workings of the CMSs in those environments, you’re limited to what you can see from the outside—acceptance testing, measuring end-user performance, etc. That’s really valuable, and a huge step towards eliminating “it worked on my machine” from your teams vocabulary, but it still leaves something to be desired.
Likewise, x-ray vision into your code and database performance is amazing to have, but if you’ve only got it in production, or if your development environments have different architecture and performance characteristics, you’re flying blind when it comes to deploying changes. The whole idea that, “it’ll be fast in production” when it’s slow in dev has lead to many a scuttled deployment. While it’s good to understand your live performance, it’s even better to predict the change you’ll see after the next deploy.
We all know that site performance has a huge impact on success. It factors into SEO, and it’s proven that faster sites get more return visits and deeper engagement from users. With this release, I believe we’ve created the ultimate toolkit for teams to develop and deliver high performance websites. We’re looking forward to leveling up with all of our developers, working together to build a faster, more delightful internet.
: Website Technology, Development, Speed & Performance, Testing & Optimization