Some customer sites unavailable

Incident Report for Pantheon Operations

Postmortem

On March 25, one of our distributed file systems clusters was subjected to unusual load, which resulted in degraded performance for some of our customers. To address the issue, we added capacity to the affected cluster, but adding capacity resulted in some files becoming temporarily hidden for some customers.

We have remediated the issue for all affected customers and will be updating our capacity playbook to ensure future capacity increases happen seamlessly and will obviate the need to purge file caches. We continue to monitor the cluster's metrics closely and are confident we have sufficient capacity to handle unusual load going forward.

The total impact lasted 55 minutes. A lot of the redundancy built into our system limited the number of customers affected ( about 0.1% of the sites hosted on the platform), but this was a major outage for the impacted sites. Following our post mortem, we have identified specific solutions that we are currently building to protect our file system from such unusual load.

Posted Apr 02, 2019 - 15:33 PDT

Resolved

This issue is now resolved.

Posted Mar 25, 2019 - 10:03 PDT

Monitoring

All affected sites are now operational and we are continuing to monitor the situation.

Posted Mar 25, 2019 - 09:33 PDT

Identified

The issue has been identified the affected sites are now responding.

Posted Mar 25, 2019 - 09:12 PDT

Update

We are continuing to investigate this issue and will update as soon as we have more information.

Posted Mar 25, 2019 - 08:54 PDT

Investigating

We are increased latency leading to some customer sites not responding and our engineering team are investigating the issue.

Posted Mar 25, 2019 - 08:30 PDT

This incident affected: Customer Sites.