We are experiencing a general site outage

Incident Report for Pantheon Operations

Postmortem

On April 23 at 22:37 UTC, an operations error caused an unintended cache clear of routing information required to access customers' application containers from our routers. Our routers required time to repopulate their caches. Uncached customer page request error rates improved linearly over the course of the event. Cached customer page requests were unaffected.

We have since 1) corrected operations documentation, 2) informed all operators of the changes, 3) identified a number of ways we can protect against a similar incident in the future, and 4) committed to evaluating those options and making further improvements in our upcoming sprints.

Posted Apr 25, 2019 - 14:58 PDT

Resolved

This incident has been resolved.
Posted Apr 23, 2019 - 17:28 PDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Apr 23, 2019 - 16:36 PDT

Identified

The issue has been identified and a fix is being implemented.
Posted Apr 23, 2019 - 16:10 PDT

Investigating

There is an increased error rate when serving uncached requests across most Pantheon sites.
Posted Apr 23, 2019 - 15:47 PDT
This incident affected: Customer Sites.