Infrastructure Issue Affecting Customer Sites

Incident Report for Pantheon Operations

Postmortem

The issue stemmed from a disruption in communication between our edge system (Fastly) and Pantheon. Pantheon uses Fastly’s edge network to provide better performance by caching pages. As pages change, Pantheon automatically refreshes them in Fastly. This incident resulted in downtime for any pages that needed to be served by Pantheon.

Pantheon's Experience Protection feature supports the ability to serve previously cached pages when we are unable to refresh a page on the edge network. This allowed us to serve previously cached pages unless a customer specifically turned off this capability for some of their pages. Additionally, some pages which are built dynamically on Pantheon could not be cached by the edge network. Such pages were not served due to the disruption between the edge network and Pantheon.

We’re taking steps to help protect against these types of issues in the future. We are enhancing our monitoring systems to detect and report such interruptions more proactively, with the goal of preventing similar incidents in the future. Additionally, we are conducting a comprehensive assessment of our monitoring and alerting framework to ensure we cover all relevant scenarios effectively.

Posted Sep 21, 2023 - 11:25 PDT

Resolved

The issue has been successfully resolved, and there should be no further disruptions. If you encounter any unexpected issues, please don't hesitate to contact our support team for immediate assistance. We greatly appreciate your patience and support throughout this incident.

Posted Sep 14, 2023 - 17:30 PDT

Update

We are making a small adjustment to increase stability and will continue to monitor for another 30 minutes, all sites are still holding stable.

Posted Sep 14, 2023 - 16:37 PDT

Update

All sites are reporting steady uptime and dashboard performance is back to normal. We will continue to monitor for another 30 minutes.

Posted Sep 14, 2023 - 16:05 PDT

Update

Ongoing monitoring is showing a steady improvement in dashboard and site responses.

Posted Sep 14, 2023 - 15:25 PDT

Update

We are continuing to monitor and seeing steady progress in dashboard and site responses. Multidev creation is now working as well, though it may take a minute for the multidev to be accessible.

Posted Sep 14, 2023 - 14:57 PDT

Update

Our team is actively addressing these issues and taking preventive measures. Thank you for your patience and support.

Posted Sep 14, 2023 - 14:19 PDT

Update

The original issue has been successfully resolved. While sites continue to recover, you may experience intermittent disruptions in both dashboard access and site functionality as we distribute the load evenly. We are also seeing some reports of new multidev instances not creating successfully.

Our team is actively addressing these side effects and implementing measures to prevent future incidents.

Posted Sep 14, 2023 - 13:39 PDT

Update

The original issue has been resolved, and while sites are recovering, some requests to the dashboard and sites may intermittently fail as load continues to be distributed. We are working to reduce the impact of this side effect.

Posted Sep 14, 2023 - 13:01 PDT

Update

Site availability is continuing to improve. We are actively monitoring and continue to see a full recovery in progress.

Posted Sep 14, 2023 - 12:12 PDT

Monitoring

A fix has been implemented and we are monitoring the results. Sites are beginning to respond but the higher initial load is leading to a delay in full site availability.

Posted Sep 14, 2023 - 11:44 PDT

Update

A fix is being deployed and is showing positive results. We expect this to be completed platform-wide very shortly.

Posted Sep 14, 2023 - 11:25 PDT

Update

We have identified the issue and are still actively working on a remediation. We apologize for the continued impact and are treating this with the highest priority.

Posted Sep 14, 2023 - 11:13 PDT

Identified

A certificate issue at the edge routing layer has been identified and a fix is being worked on immediately.

Posted Sep 14, 2023 - 10:41 PDT

Update

We are continuing to investigate this issue at the highest priority and understand this issue is significantly impacting site availability. We are working to resolve this as soon as possible. Initial investigation points to a networking failure affecting global infrastructure.

We will update this page as soon as we have more information.

For urgent issues, please contact helpdesk@pantheon.io

Posted Sep 14, 2023 - 10:35 PDT

Investigating

We are investigating an infrastructure failure that is affecting customer sites. Current symptoms identified are 503 errors leading to site unavailability.

For urgent issues, please contact helpdesk@pantheon.io

Posted Sep 14, 2023 - 10:00 PDT

This incident affected: Customer Sites, Dashboard, and Spinup Operations.