Database endpoint down
Incident Report for Pantheon Operations
Postmortem

On Saturday June 20th around 10:40am PDT, two Virtual Machines running database containers experienced extreme unresponsiveness and ping failure. This affected 0.2% of sites on the platform. Upon alert, all sites with Disaster Recovery enabled were failed over to their replica database. These sites were unaffected for the remainder of the outage. After recovering the virtual machine it appeared stable for an hour before networking began failing again. At this point we proceeded to replace the virtual machines and recovered all sites. The total on and off disruption went on for approximately three hours. We are currently working with our service provider to understand the nature of the network failure to these servers.

Posted Jun 22, 2020 - 17:48 PDT

Resolved
This incident has been resolved.
Posted Jun 20, 2020 - 22:10 PDT
Update
Our monitoring indicates all down sites have recovered. We continue to monitor closely while waiting for confirmation that the network issue has been completely resolved.
Posted Jun 20, 2020 - 21:12 PDT
Update
Mitigation work is still underway by our engineering team. We believe that majority of VMs connectivity issues in us-central1-c have been recovered.
Posted Jun 20, 2020 - 19:08 PDT
Update
Our monitoring detected another brief interruption to a small number of sites. All sites have recovered at this time. We are continuing to monitor closely while waiting for confirmation that the network issue has been completely resolved.
Posted Jun 20, 2020 - 17:28 PDT
Monitoring
Our monitoring indicates all down sites have recovered. We continue to monitor closely while waiting for confirmation that the network issue has been completely resolved.
Posted Jun 20, 2020 - 15:19 PDT
Update
We are continuing to work on a fix for this issue.
Posted Jun 20, 2020 - 13:38 PDT
Identified
The issue has been identified as a network issue affecting multiple VMs.
Posted Jun 20, 2020 - 13:02 PDT
Update
We are still investigating this issue.
Posted Jun 20, 2020 - 12:27 PDT
Update
We are continuing to investigate this issue.
Posted Jun 20, 2020 - 11:53 PDT
Investigating
We are seeing another failed database endpoint with a similar pattern, we are currently investigating this issue.
Posted Jun 20, 2020 - 11:29 PDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jun 20, 2020 - 10:50 PDT
Identified
The issue has been identified and a fix is being implemented.
Posted Jun 20, 2020 - 10:30 PDT
Update
We are continuing to investigate this issue.
Posted Jun 20, 2020 - 10:06 PDT
Investigating
We are investigating a failed database endpoint.
Posted Jun 20, 2020 - 09:57 PDT
This incident affected: Customer Sites.