The cloud is a noisy place to operate. Any machine can be any place in the data center. Network connections often span multiple switches, and activity from other cloud users can saturate links and cause latency spikes and packet loss. These are the facts of life, and engineering around them is critical to the reliability of clustered cloud infrastructures, and it's how we spend a lot of our time at Pantheon.
I’d like to share some news on this front for Pantheon’s distributed file system, Valhalla. Our latest work includes grace/saint-mode support, batched heavy operations, cache peering, and isolated shards.
Grace and Saint Mode
The first feature was inspired by Varnish’s grace and saint model, and supports reading files and directories during interrupted connectivity. This is critical for bridging short gaps in networking service, and is deployed to all Pantheon DROPs as of this week.
Since Valhalla 4, our file system clients have included a comprehensive, event-updated file and stat cache managed using LevelDB. But, maintaining cache coherency is hard , and clients often check in with the server to fetch updates and verify the consistency of their data. Grace mode detects when a connectivity issue prevents such validation and event fetching. In over 99.9% of cases, the local cache requires no update, so skipping validation has no visible effect. Once connectivity improves, the client cache goes back to validating and fetching events from where it last left off.
Waiting for connectivity to fail can take a while, though, when many pages require access to 20+ file assets in a short window of time. To solve this, we added saint mode support, which causes automatic use of grace mode for a 10-second period following a connectivity failure. Where we used to see a block of failures, we now see a single request saved by grace mode with saint mode kicking in to spare the following requests from waiting for failure. The result is improved speed and reliability for files served off Pantheon’s platform.
Network connectivity wasn’t the only cause of issues. We were seeing substantial spikes in load and read latency in (1) the middle of each night when many sites generate manifests for their scheduled backups and (2) when users loaded their dashboards, causing statistics gathering for the relevant volumes.
We’ve re-written how manifests work to have backup client-managed batching that requests 500 entries at a time as needed. This spreads out the overhead of assembling the manifests and reduces Cassandra and Python processing time and memory load for each request. We finished deploying this work on Sunday.
For statistics, we removed them from the dashboard pending a rewrite of the implementation that reduces production load, and a better use-case for taking action based on the stats. We started publishing them to the dashboard many months ago when there were issues with an older Valhalla client that scaled badly under certain directory and file counts. The statistics were useful for nuding Pantheon developers to avoid a non-optimal case, but we solved the root problem and this use-case no longer exists.
In Early Access Deployment: More Resilient Network Access
Several of our key customers are running an “early access” release of the Valhalla client that uses a smarter HTTP client. Specifically, this release uses libcurl, which is the same library Drupal developers will recognize in its PHP-wrapped form .
The important advance over the Neon-based client is non-blocking socket access and SSL negotiation. When a connectivity interruption occurs in certain places, Neon has a hard time breaking out of the bad connection and allowing the client to recover. The libcurl approach supervises all aspects of connectivity, ready to break out and try other strategies when problems occur.
Combined with grace mode and less-blocking nginx configurations we’re rolling out, it’s another major step in allowing sites to weather through connectivity issues.
In Development: Cache Peering, Sharding
There are additional improvements still in development. Cache peering allows Valhalla clients to get file content from the caches of other DROPs. Because files in Valhalla are content-addressed, this sort of sharing is straightforward to integrate, and also very secure. It will also allow us to reduce the working set in our Cassandra clusters to improve metadata performance for file and directory stat operations.
Sharding is an ongoing project at Pantheon to isolate the availability impact of connectivity, software, and hardware issues. Cassandra already shards Valhalla data, but that’s within the scope of a single Cassandra cluster. In the long-term, we’re working on assigning distinct Valhalla/Cassandra clusters within each datacenter and geographic area. This will also help us progressively deploy software and schema updates, perform upgrades with lower impact and limit the scope of any “worst case scenario” disruption. It is also key to allowing us to deploy Pantheon outside of North America.
Why do we go to all this engineering effort around Drupal files?
It’s to deliver on the promise of smooth scaling: a consistent architecture and performance model all the way from a small prototype site to a large enterprise cluster. Even in 2013, the Valhalla alternatives either can’t scale up (local files), can’t scale down (GlusterFS/HekaFS), can’t run a Drupal cluster properly (rsync), or can’t run securely and reliably in containerized, cloud-based deployments (NFS and Ceph).
We’ll keep you up to date as we complete the later phases of this roadmap.
 It’s hard enough that the leading general-purposed distributed file system, GlusterFS, doesn’t even try. Because we’ve engineered Valhalla specifically for web media needs, we’ve been able to narrow the challenge to something implementable. Read more: Education