Infrastructure Issue Affecting Customer Sites
Incident Report for Pantheon Operations
Postmortem

On May 11th between 13:40 and 13:55 UTC a number of sites experienced degraded service. This event was triggered by a configuration update to our Global CDN. The change consisted in a security update to our internal TLS infrastructure. The rollout was closely monitored by our engineering team who quickly detected the problem and started the rollback procedure. By 13:52 UTC, the previous version of the Global CDN configuration was restored and by 13:55 UTC all traffic was restored to normal.

The configuration update was intended to increase the security of our internal TLS infrastructure. Although the change was tested before deployment, due to a slight divergence in configuration between our testing and production systems, unfortunately, the faulty configuration ended up in production. This change impacted only uncached traffic. Traffic that was cached by the Global CDN, Advanced Global CDN or any other CDN in front of Pantheon was not impacted.

We have conducted a review of the incident and have identified clear next steps to prevent recurrence of this type of failure mode. We have planned improvements in both our documentation and tooling, including creating additional pre-production environments that will allow us to better test configuration changes.  We also plan to increase the number of redundant Global CDN production services as well as implement progressive configuration rollouts. These improvements are intended to reduce the likelihood and impact of a similar incident.

We recognize the criticality of this service and we sincerely apologize for the negative affects our customers experienced and will work to continue to earn your trust.

Posted May 14, 2021 - 10:51 PDT

Resolved
This incident has been resolved.
Posted May 11, 2021 - 07:00 PDT
Monitoring
We have implemented a fix and are monitoring it now.
Posted May 11, 2021 - 06:29 PDT
Identified
The issue has been identified and a fix is being implemented.
Posted May 11, 2021 - 05:59 PDT
Investigating
We are addressing an infrastructure failure that is affecting customer sites.
Posted May 11, 2021 - 05:53 PDT
This incident affected: Customer Sites.