On May 11th between 13:40 and 13:55 UTC a number of sites experienced degraded service. This event was triggered by a configuration update to our Global CDN. The change consisted in a security update to our internal TLS infrastructure. The rollout was closely monitored by our engineering team who quickly detected the problem and started the rollback procedure. By 13:52 UTC, the previous version of the Global CDN configuration was restored and by 13:55 UTC all traffic was restored to normal.
The configuration update was intended to increase the security of our internal TLS infrastructure. Although the change was tested before deployment, due to a slight divergence in configuration between our testing and production systems, unfortunately, the faulty configuration ended up in production. This change impacted only uncached traffic. Traffic that was cached by the Global CDN, Advanced Global CDN or any other CDN in front of Pantheon was not impacted.
We have conducted a review of the incident and have identified clear next steps to prevent recurrence of this type of failure mode. We have planned improvements in both our documentation and tooling, including creating additional pre-production environments that will allow us to better test configuration changes. We also plan to increase the number of redundant Global CDN production services as well as implement progressive configuration rollouts. These improvements are intended to reduce the likelihood and impact of a similar incident.
We recognize the criticality of this service and we sincerely apologize for the negative affects our customers experienced and will work to continue to earn your trust.