At 11:45 GMT the operations team confirmed that the emergency downgrade procedure worked correctly and all websites are functional.
At 11:25 GMT the operations team decided to execute emergency downgrade procedure, as the issue was tracked down to post upgrade problem.
Before 11:05 GMT our monitoring system detected the malfunctioning sites. Our operations team started to work on the issue.
Due to upgrade operation on the frontend infrastructure some websites including:
Were not accessible.
The issue happened due to an issue in the new version of our frontend software. Some of deployed setups were incompatible with the new version of the frontend software.
This kind of issue is not related to existing data loss.
It was impossible for the users to access some websites, also it could lead to some tools not being able to send data to backends served by the frontend infrastructure.
Our monitoring correctly detected the problem.
Our emergency downgrade procedure worked correctly and allowed us to put back services online without additional effort.
We are going to improve our procedures, including selecting different times for the upgrades, in order to minimise eventual negative impact of such situation.
We are going to improve our automated test suites to cover this case.
We are planning to improve our frontend infrastructure to do selective upgrade, which will result with even less impact of the users, and will allow us to detect issues on our infrastructure with much less impact.