Database endpoint down

Incident Report for Pantheon Operations

Postmortem

On Saturday June 20th around 10:40am PDT, two Virtual Machines running database containers experienced extreme unresponsiveness and ping failure. This affected 0.2% of sites on the platform. Upon alert, all sites with Disaster Recovery enabled were failed over to their replica database. These sites were unaffected for the remainder of the outage. After recovering the virtual machine it appeared stable for an hour before networking began failing again. At this point we proceeded to replace the virtual machines and recovered all sites. The total on and off disruption went on for approximately three hours. We are currently working with our service provider to understand the nature of the network failure to these servers.

Posted Jun 22, 2020 - 17:48 PDT

Resolved

This incident has been resolved.

Posted Jun 20, 2020 - 22:10 PDT

Update

Our monitoring indicates all down sites have recovered. We continue to monitor closely while waiting for confirmation that the network issue has been completely resolved.

Posted Jun 20, 2020 - 21:12 PDT

Update

Mitigation work is still underway by our engineering team. We believe that majority of VMs connectivity issues in us-central1-c have been recovered.

Posted Jun 20, 2020 - 19:08 PDT

Update

Our monitoring detected another brief interruption to a small number of sites. All sites have recovered at this time. We are continuing to monitor closely while waiting for confirmation that the network issue has been completely resolved.

Posted Jun 20, 2020 - 17:28 PDT

Monitoring

Our monitoring indicates all down sites have recovered. We continue to monitor closely while waiting for confirmation that the network issue has been completely resolved.

Posted Jun 20, 2020 - 15:19 PDT

Update

We are continuing to work on a fix for this issue.

Posted Jun 20, 2020 - 13:38 PDT

Identified

The issue has been identified as a network issue affecting multiple VMs.

Posted Jun 20, 2020 - 13:02 PDT

Update

We are still investigating this issue.

Posted Jun 20, 2020 - 12:27 PDT

Update

We are continuing to investigate this issue.

Posted Jun 20, 2020 - 11:53 PDT

Investigating

We are seeing another failed database endpoint with a similar pattern, we are currently investigating this issue.

Posted Jun 20, 2020 - 11:29 PDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jun 20, 2020 - 10:50 PDT

Identified

The issue has been identified and a fix is being implemented.

Posted Jun 20, 2020 - 10:30 PDT

Update

We are continuing to investigate this issue.

Posted Jun 20, 2020 - 10:06 PDT

Investigating

We are investigating a failed database endpoint.

Posted Jun 20, 2020 - 09:57 PDT

This incident affected: Customer Sites.