At approximately 1 PM PST yesterday ( Sept 5th ) the platform suffered an outage that impacted a number of sites. We detected the issue within 3 minutes, and the total degradation was resolved within 30 minutes. Many sites continued to serve content via our Global CDN caching layer during the outage. The main cause was a deployment to our edge layer that purged one of our main (route) caches too aggressively. As part of improving our platform we deploy code multiple times a day with many layers of testing and protections to avoid such outages. We rarely modify our edge layer. We performed a post mortem of yesterday's incident and have added multiple layers of protection to prevent this from happening again. Such changes in the future will invalidate the caching more slowly, and will provision extra capacity in order for the cache to warm up faster.
Posted Sep 06, 2018 - 16:33 PDT
Our apologies for this interruption in service. The reason for this outage has been identified and is well understood. Our team will be meeting tomorrow to identify process improvements to prevent similar incidents in the future. The result of that effort will be posted here by the end of the day.
Posted Sep 05, 2018 - 14:10 PDT
A fix has been implemented and we are monitoring the results.
Posted Sep 05, 2018 - 13:32 PDT
The issue has been identified and a fix is being implemented.