15/09/2020 11:45 GMT All websites are accessible [INCIDENT FINISHED]
At 11:45 GMT the operations team confirmed that the emergency downgrade procedure worked correctly and all websites are functional.
15/09/2020 11:25 GMT Decision to execute emergency downgrade procedure
At 11:25 GMT the operations team decided to execute emergency downgrade procedure, as the issue was tracked down to post upgrade problem.
15/09/2020 11:05 GMT Monitoring detects malfunctioning sites
Before 11:05 GMT our monitoring system detected the malfunctioning sites. Our operations team started to work on the issue.
15/09/2020 11:00 GMT Some websites became not accessible [INCIDENT STARTED]
Due to upgrade operation on the frontend infrastructure some websites including:
- rapid.space
- status.rapid.space
- handbook.rapid.space
- shop.rapid.space
- slapos.rapid.space
- console.rapid.space
Were not accessible.
Additional information
Reason
The issue happened due to an issue in the new version of our frontend software. Some of deployed setups were incompatible with the new version of the frontend software.
Impact
This kind of issue is not related to existing data loss.
It was impossible for the users to access some websites, also it could lead to some tools not being able to send data to backends served by the frontend infrastructure.
Lessons learnt
Our monitoring correctly detected the problem.
Our emergency downgrade procedure worked correctly and allowed us to put back services online without additional effort.
We are going to improve our procedures, including selecting different times for the upgrades, in order to minimise eventual negative impact of such situation.
We are going to improve our automated test suites to cover this case.
We are planning to improve our frontend infrastructure to do selective upgrade, which will result with even less impact of the users, and will allow us to detect issues on our infrastructure with much less impact.