Incident summary
Between the hour of 16:23 and 16:46 CEST on the 20th of June, a cascading failure during a Redis migration to version 6 affected some deployments, resulting in a service outage.
Impact
This incident affected some customers, who experienced a downtime in the application for 23 minutes.
Root Causes
During a Redis migration, a bug in the GitLab CI pipeline was identified. This bug truncates the trailing 0 of the container image tag, so it deployed a release version 1.2
instead of 1.20
, causing a CrashLoopBackOff error for one of our Deployment containers. The environment variables for the Deployments were moved to the new Redis 6 instance when the Deployments were still in CrashLoopBackOff.
Trigger
One of our Deployment’s ConfigMap was modified, changing the Redis server environment variable to the new Redis 6 instance. The restart of these Deployments updated their ConfigMaps, when one of them were still in a CrashLoopBackOff status, causing a cascading failure that affected other Deployments and triggering the incident.
Detection
The Redis 6 queue started to fill up with tasks ahead of time, and the application stopped working.
Timeline
2022-06-20 (all times are CEST)
- 15:29 - Created new Redis 6 instance
- 15:44 - Created ConfigMap and Secret for auxiliary Deployments, containing environment variables pointing to Redis 6 instance
- 15:59 - INCIDENT BEGINS Created auxiliary Deployments, that were deployed with release version
1.2
instead of 1.20
- 16:17 - Updated environment variables, pointing to Redis 6 instance
- 16:19 - Redis 6 queue began to fill up progressively
- 16:23 - OUTAGE BEGINS The application stopped working
- 16:39 - Deleted auxiliary Deployments
- 16:46 - Redis 6 queue reached 309K length
- 16:46 - OUTAGE MITIGATED Environment variables restored to previous Redis 5 instance
- 16:46 - OUTAGE ENDS All services restored and working correctly
- 17:08 - Followed procedure to continue with the Redis 6 migration
- 17:10 - After running GitLab CI pipeline to create auxiliary Deployments, container images were manually set to release
1.20
- 17:11 - Redis 6 queue started to decrease
- 17:15 - INCIDENT ENDS Redis 6 queue emptied
- 17:18 - Original Deployments moved to Redis 6 instance
- 17:31 - Deleted auxiliary Deployments
- 17:44 - Redis 6 migration completed
Action Items as result of Postmortem
- Investigate GitLab CI bug
- Update Redis migration playbook
- Recompose internal metrics during the incident