Infrastructure Issue Affecting Customer Sites
Incident Report for Pantheon Operations
Postmortem

On 2/28/2017 AWS S3 experienced widespread issues which affected Pantheon and our customers. According to Amazon, the issues were caused by high error rates with S3 in US-EAST-1. We first noticed issues around 9:30am PST and most functions went back to normal around 3pm PST.

Pantheon’s infrastructure is hosted on Rackspace, however our file storage system (Valhalla) leverages components from both Rackspace and S3. Due to this, customers experienced issues with their dashboards, sites and support services. The file storage system was designed to withstand interruptions in the S3 service. However, the issues experienced by S3 were longer than what we had designed the file storage system to withstand. We put the system into a read-from-cache mode which allowed many sites to continue to provide some service instead of being completely down.

Several improvements have been identified to limit the impact of similar incidents. We are looking into cross provider storage services to decrease our dependency on a single provider. We apologize for the inconvenience caused by this interruption and remain committed to keeping your trust in Pantheon.

Posted Mar 01, 2017 - 18:26 PST

Resolved
The Amazon S3 outage has been resolved and affected sites on Pantheon have returned to normal. Thanks again for your patience. Please let us know if you continue to have issues and we'll be happy to investigate further.
Posted Feb 28, 2017 - 16:08 PST
Update
Amazon is reporting that S3 is functioning normally again. We too are seeing significant improvements. While we believe the issue is resolved, we will continue monitoring.
Posted Feb 28, 2017 - 14:26 PST
Update
We have brought some Pantheon file systems back online. Although we're still observing errors from AWS, they're continuing to improve.
Reading files appears to be recovered. However, write operations are still degraded on Pantheon.
Posted Feb 28, 2017 - 13:35 PST
Update
We're seeking confirmation and testing before we bring the Pantheon system back online.
Posted Feb 28, 2017 - 12:50 PST
Update
There are no new updates to the Amazon S3 outage. The Pantheon filesystem is still in read-from-cache only mode to try to improve the performance of sites.
Posted Feb 28, 2017 - 11:40 PST
Monitoring
We have various systems affected by the ongoing amazon S3 outage, including the Pantheon Filesystem. We have forced the Pantheon filesystem into read-from-cache only mode to try to improve the performance of sites.
Posted Feb 28, 2017 - 10:38 PST
Investigating
We are investigating an infrastructure failure that is affecting a portion of customer sites.
Posted Feb 28, 2017 - 09:50 PST