Infrastructure Issue Affecting Customer Sites

Incident Report for Pantheon Operations

Postmortem

On Wednesday, July 29th at 1:20 PM PT, a networking disruption caused errors in several availability zones in our US region. The disruption had several impacts including errors returned for requests not cached in the Global CDN. Additionally the Pantheon Dashboard was also unavailable for a short time. 

Approximately 8% of sites on the Platform were affected for 15 minutes with almost all sites operating normally by 1:35 PM. Less than 0.5% of sites did not recover immediately and required additional corrective action. This was largely completed by 2:31 PM, yet some sites may have seen lingering effects as late as 2 AM PST on July 31. 

This incident provided useful insights into a failure mode where some sites' PHP workers may become stuck after network disruptions. We are looking into detecting PHP workers that enter this state and automatically restart them to speed recovery.

Posted Aug 04, 2020 - 16:40 PDT

Resolved

This incident has been resolved.
Posted Jul 30, 2020 - 06:39 PDT

Update

We are continuing to monitor for any further issues.
Posted Jul 30, 2020 - 05:41 PDT

Update

We are continuing to monitor for any further issues.
Posted Jul 30, 2020 - 04:33 PDT

Update

We are continuing to monitor for any further issues.
Posted Jul 30, 2020 - 03:15 PDT

Update

We are continuing to monitor for any further issues.
Posted Jul 30, 2020 - 02:10 PDT

Update

We are continuing to monitor for any further issues.
Posted Jul 30, 2020 - 01:08 PDT

Update

We are continuing to monitor for any further issues.
Posted Jul 30, 2020 - 00:40 PDT

Update

We are continuing to monitor for any further issues.
Posted Jul 29, 2020 - 23:18 PDT

Update

We are still continuing to monitor for any further issues.
Posted Jul 29, 2020 - 22:03 PDT

Update

We are still continuing to monitor for any further issues.
Posted Jul 29, 2020 - 20:31 PDT

Update

We are still continuing to monitor for any further issues.
Posted Jul 29, 2020 - 19:12 PDT

Update

We are still continuing to monitor for any further issues.
Posted Jul 29, 2020 - 18:03 PDT

Update

We are still continuing to monitor for any further issues.
Posted Jul 29, 2020 - 17:02 PDT

Update

We are continuing to monitor for any further issues.
Posted Jul 29, 2020 - 15:44 PDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jul 29, 2020 - 14:16 PDT

Update

We are continuing to work on a fix for this issue.
Posted Jul 29, 2020 - 14:05 PDT

Identified

The issue has been identified and a fix is being implemented.
Posted Jul 29, 2020 - 13:37 PDT

Investigating

We are addressing an infrastructure issue that is affecting customer sites and the Pantheon Dashboard.
Posted Jul 29, 2020 - 13:28 PDT
This incident affected: Customer Sites and Dashboard.