At 0002 UTC July 2, Pantheon observed that the contents of some customer files had been replaced with the contents of other files, possibly those belonging to other Pantheon customer(s). The problem was found to have been caused by a code change we made to address a bug in our filesystem persistence layer.
The platform uses a write-back cache for small files uploaded to a site, and the process responsible for writing files under 1MB from the cache to actual storage had a section of code that was not thread safe. This issue only manifested in very specific conditions. This translated into writing a corrupted file which could include bytes of memory that may not belong to the given destination site/file.
We reverted the change we had pushed to our systems that enabled this bug to manifest and proceeded to aggressively delete the files that could have been corrupted in order to minimize chances of exposure. For risk management reasons we focused primarily on live sites, we have also since disabled access to backups that were generated during that time, accessing these backups will require you to go through our customer support.
We are currently running an exhaustive audit on all our volumes for all environments to ensure that we have covered all possible corrupted files, and will delete any further corrupted files found. Any new findings will be disclosed directly to the relevant organizations via a support ticket. This audit is estimated to take 48-72hrs.
We have high confidence that new corrupted files are not being generated, but we also are implementing a few changes: We are making our off-loader process thread safe, but also investigating our ability to remove our write-back cache completely as part of a file system rewrite. We will also look to add a layer of content verification logic to prevent the platform from serving such files in the future.