On Wednesday, July 29th at 1:20 PM PT, a networking disruption caused errors in several availability zones in our US region. The disruption had several impacts including errors returned for requests not cached in the Global CDN. Additionally the Pantheon Dashboard was also unavailable for a short time.
Approximately 8% of sites on the Platform were affected for 15 minutes with almost all sites operating normally by 1:35 PM. Less than 0.5% of sites did not recover immediately and required additional corrective action. This was largely completed by 2:31 PM, yet some sites may have seen lingering effects as late as 2 AM PST on July 31.
This incident provided useful insights into a failure mode where some sites' PHP workers may become stuck after network disruptions. We are looking into detecting PHP workers that enter this state and automatically restart them to speed recovery.