We are experiencing a general site outage
Incident Report for Pantheon Operations
Postmortem

On April 23 at 22:37 UTC, an operations error caused an unintended cache clear of routing information required to access customers' application containers from our routers. Our routers required time to repopulate their caches. Uncached customer page request error rates improved linearly over the course of the event. Cached customer page requests were unaffected.

We have since 1) corrected operations documentation, 2) informed all operators of the changes, 3) identified a number of ways we can protect against a similar incident in the future, and 4) committed to evaluating those options and making further improvements in our upcoming sprints.

Posted 5 months ago. Apr 25, 2019 - 14:58 PDT

Resolved
This incident has been resolved.
Posted 5 months ago. Apr 23, 2019 - 17:28 PDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted 5 months ago. Apr 23, 2019 - 16:36 PDT
Identified
The issue has been identified and a fix is being implemented.
Posted 5 months ago. Apr 23, 2019 - 16:10 PDT
Investigating
There is an increased error rate when serving uncached requests across most Pantheon sites.
Posted 5 months ago. Apr 23, 2019 - 15:47 PDT
This incident affected: Customer Sites.