Infrastructure Issue Affecting Customer Sites
Incident Report for Pantheon Operations
Postmortem

On Wednesday, July 29th at 1:20 PM PT, a networking disruption caused errors in several availability zones in our US region. The disruption had several impacts including errors returned for requests not cached in the Global CDN. Additionally the Pantheon Dashboard was also unavailable for a short time. 

Approximately 8% of sites on the Platform were affected for 15 minutes with almost all sites operating normally by 1:35 PM. Less than 0.5% of sites did not recover immediately and required additional corrective action. This was largely completed by 2:31 PM, yet some sites may have seen lingering effects as late as 2 AM PST on July 31. 

This incident provided useful insights into a failure mode where some sites' PHP workers may become stuck after network disruptions. We are looking into detecting PHP workers that enter this state and automatically restart them to speed recovery.

Posted Aug 04, 2020 - 16:40 PDT

Resolved
This incident has been resolved.
Posted Jul 30, 2020 - 06:39 PDT
Update
We are continuing to monitor for any further issues.
Posted Jul 30, 2020 - 05:41 PDT
Update
We are continuing to monitor for any further issues.
Posted Jul 30, 2020 - 04:33 PDT
Update
We are continuing to monitor for any further issues.
Posted Jul 30, 2020 - 03:15 PDT
Update
We are continuing to monitor for any further issues.
Posted Jul 30, 2020 - 02:10 PDT
Update
We are continuing to monitor for any further issues.
Posted Jul 30, 2020 - 01:08 PDT
Update
We are continuing to monitor for any further issues.
Posted Jul 30, 2020 - 00:40 PDT
Update
We are continuing to monitor for any further issues.
Posted Jul 29, 2020 - 23:18 PDT
Update
We are still continuing to monitor for any further issues.
Posted Jul 29, 2020 - 22:03 PDT
Update
We are still continuing to monitor for any further issues.
Posted Jul 29, 2020 - 20:31 PDT
Update
We are still continuing to monitor for any further issues.
Posted Jul 29, 2020 - 19:12 PDT
Update
We are still continuing to monitor for any further issues.
Posted Jul 29, 2020 - 18:03 PDT
Update
We are still continuing to monitor for any further issues.
Posted Jul 29, 2020 - 17:02 PDT
Update
We are continuing to monitor for any further issues.
Posted Jul 29, 2020 - 15:44 PDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jul 29, 2020 - 14:16 PDT
Update
We are continuing to work on a fix for this issue.
Posted Jul 29, 2020 - 14:05 PDT
Identified
The issue has been identified and a fix is being implemented.
Posted Jul 29, 2020 - 13:37 PDT
Investigating
We are addressing an infrastructure issue that is affecting customer sites and the Pantheon Dashboard.
Posted Jul 29, 2020 - 13:28 PDT
This incident affected: Customer Sites and Dashboard.