UPDATE: Since this post, we changed direction by writing a custom client (discussed in this post on Valhalla's open source components). It turned out to be so efficient at what it does that putting additional work into PHP (via file streams, etc.) or bypassing the appserver would slow things down.
At Pantheon, when we look at challenges Drupal projects face, we don't only do what's worked well-enough in existing deployments. We ask ourselves how we can transcend some of the challenges entirely. How Drupal stores files is no exception. So, when we started building the next generation of Pantheon (now launched), we looked at our options—and then built a solution entirely focused on the needs of Drupal developers.
We started with a survey of existing technology, both to see if something off-the-shelf would work for us and to inform the design of any system we might implement.
Local filesystem: This is how we delivered the original Pantheon. It's fast, simple, and reliable, but there's no way to have additional application servers share the same set of files or deliver good high-availability; this option was almost immediately off the table.
Enterprise SAN: We collected quotes for a few options, and they were almost all unscalable (for what we need to bring an awesome Drupal platform to everyone) or overpriced (almost all the capacity has to be bought up-front).
Cloud block storage: Amazon's Elastic Block Store (EBS) has notorious reliability issues in addition to common problems with "successful" snapshots actually working. EBS volumes are limited in scale and only get good performance when striped (which also breaks the ability to use snapshotting at all).
Cloud file storage (like S3): There's no reliable way to mount these and provide multi-level directories. They're also very high-latency and prone to brief access problems when used from a different datacenter.
GlusterFS: This was the most promising option, but it makes tradeoffs (ones that Drupal projects don't need) for fidelity to traditional filesystem semantics, like random block I/O, locking, and optimization for deep filesystem hierarchies. GlusterFS buildouts require adding a logical volume manager or a special filesystem to get snapshot capability. Client machines in a GlusterFS cluster have to have UIDs and GIDs synchronized, generally meaning use of LDAP (another possible point-of-failure) or other system management tools. Providing access to a Gluster filesystem on a server without it being part of the cluster requires exporting access over something like NFS, which requires making that export service itself highly available (HA), too. Resolving a split-brain across a cluster sometimes requires manual administrator intervention. Geo-distributed replication is only possible with a primary/secondary configuration; this complicates fail-over and limits the capability of datacenters housing the "secondary" instances. There's some effort to simplify setups where one cluster serves multiple, isolated customers, like the GlusterFS derivative HekaFA, but they're quite immature.
What Drupal actually needs
If we set out to build yet another totally generic clustered filesystem, we'd be fools. Fortunately, Drupal's needs for files are pretty specific.
Most write operations create entire files. Drupal doesn't usually modify a byte range (in contrast to, say, database servers).
Most read operations hit the edge cache, like Varnish or a CDN. Read performance directly off the filesystem isn't critical.
Most files, once written, never change. (They might, however, get deleted.)
Consistency is less important than availability. It's better to allow access to the latest known version of a file (especially given how little a single file changes once created in Drupal) than fail.
When a Drupal site has multiple environments (dev, test, live, etc.), the vast majority of files will be identical between them. Changes, however, to dev or test should not affect live.
Most files are small (under 5MB), especially because the uploads tend to happen through Drupal.
Files are numerous. It's not totally uncommon for a directory in Drupal to have 10,000 files or more -- often images.
How storage works in Valhalla
Valhalla's storage architecture is more similar to systems like Amazon S3 than traditional filesystems backed by block devices.
In filesystems, a volume is a unique namespace for managing directories and files. Everything Valhalla stores is broken into a series of volumes. Each environment (dev, test, live) of each site on Pantheon gets its own Valhalla volume. Valhalla volumes exist as wide rows in Cassandra that map individual paths to metadata, including a SHA-512 hash of the file content. We can pack all of this data into a single row because, in Cassandra, a row can scale to over two billion entries if the columns are tiny like our file and directory entries.
Valhalla also manages the content of files by creating a row for each content hash. Each row contains a series of columns (each up to 5MB) named after the offsets into the content of the file. A file under 5MB simply has one column named "0". Because file content is addressed by its hash, multiple references to the same content (whether from the same or different volumes) are able to use the same content. This hash addressing automatically prevents duplicate storage (other than in Cassandra, where we keep three copies of everything already).
Valhalla also uses a copy-on-write strategy for when files change; writing different content to an existing file causes the new content to get its own content row and the entry on the volume to be pointed at the new content. To clean up old content that isn't used in any volumes, Valhalla asynchronously counts references (using a special strategy to avoid race conditions in Cassandra) and deletes content that has achieved a stable state of zero references.
Cloning and snapshotting volumes
Valhalla doesn't just use content hash addressing and copy-on-write to save disk space. It also allows rapid cloning of volumes. When a developer on Pantheon uses the dashboard to "sync" files for an environment, Valhalla simply replaces the target volume (usually dev or test) with a cloned version of the source volume (usually live). Developers with a history of waiting on rsync will be happy to know this takes Valhalla under five seconds for ten thousand files. We can also use this functionality for snapshotting volumes by cloning to a destination that does not already exist.
Providing Drupal's files directory
But all this fancy storage on Pantheon is useless if Drupal can't read and write as it expects to its "files" directory. Instead of using the FUSE driver + re-export model of GlusterFS, the Valhalla server directly provides a WebDAV server written in Twisted Python. The server authenticates access and encrypts data by using Pantheon's platform-wide certificate infrastructure. Application servers running Pantheon sites mount each environment's volume using davfs2, which also caches the file content locally so that Drupal servers don't need to download a fresh copy if the file in Valhalla hasn't changed. A load-balancer fronts the whole Valhalla cluster to provide HA and distribute requests to each Valhalla server.
While Valhalla distributes three copies of every asset (whether volume entry or file content) internally, Pantheon still provides off-site file backups for ultimate assurance. We run backups on a server we call "Ellis Island" (because it handles imports to Pantheon, too) using Jenkins. When Jenkins is performing a backup, it mounts the target volume, creates a compressed tarball, and ships it off for storage in Amazon S3. Pantheon makes the archives available for developer download from the dashboard by using S3's "signed URL" facility.
While we're glad we can provide the projects on Pantheon with a reliable, scalable solution to providing Drupal with a "files" directory across multiple application servers, there's more work to be done. Here are some ideas we're looking at. (These aren't formally on our roadmap, especially near-term.)
PHP streams support: Right now, Drupal 7 sites on Pantheon access Valhalla using the local filesystem, which transparently back-ends to Valhalla using davfs2. It would be more efficient to provide direct PHP stream access to Valhalla and skip using WebDAV when possible.
Edge integration: Currently, we route requests that miss Varnish through a custom node.js proxy called Styx (named because it takes requests to their final destination), then to nginx on an application server. If the request is for a static file, nginx then accesses the filesystem mounted using davfs2. It would be more efficient to have Styx directly access Valhalla.
CDN integration: A lot of the difficulty around CDN integration is synchronizing local assets with the CDN. Because Valhalla knows about file changes at a high level, it would provide a great integration point for auto-syncronizing changes on a volume to a CDN.
Desktop access: It's already possible to import archives of files and install various file-management modules on Drupal sites running on Pantheon, but we'll be looking into proper desktop access, probably with WebDAV or SFTP/SSHFS.