On March 25, one of our distributed file systems clusters was subjected to unusual load, which resulted in degraded performance for some of our customers. To address the issue, we added capacity to the affected cluster, but adding capacity resulted in some files becoming temporarily hidden for some customers.
We have remediated the issue for all affected customers and will be updating our capacity playbook to ensure future capacity increases happen seamlessly and will obviate the need to purge file caches. We continue to monitor the cluster's metrics closely and are confident we have sufficient capacity to handle unusual load going forward.
The total impact lasted 55 minutes. A lot of the redundancy built into our system limited the number of customers affected ( about 0.1% of the sites hosted on the platform), but this was a major outage for the impacted sites. Following our post mortem, we have identified specific solutions that we are currently building to protect our file system from such unusual load.