Downtime during component migration

Incident Report for Landbot

Postmortem

Incident summary

Between the hour of 16:23 and 16:46 CEST on the 20th of June, a cascading failure during a Redis migration to version 6 affected some deployments, resulting in a service outage.

Impact

This incident affected some customers, who experienced a downtime in the application for 23 minutes.

‌Root Causes

During a Redis migration, a bug in the GitLab CI pipeline was identified. This bug truncates the trailing 0 of the container image tag, so it deployed a release version 1.2 instead of 1.20, causing a CrashLoopBackOff error for one of our Deployment containers. The environment variables for the Deployments were moved to the new Redis 6 instance when the Deployments were still in CrashLoopBackOff.

Trigger

One of our Deployment’s ConfigMap was modified, changing the Redis server environment variable to the new Redis 6 instance. The restart of these Deployments updated their ConfigMaps, when one of them were still in a CrashLoopBackOff status, causing a cascading failure that affected other Deployments and triggering the incident.

Detection

The Redis 6 queue started to fill up with tasks ahead of time, and the application stopped working.

Timeline

2022-06-20 (all times are CEST)

15:29 - Created new Redis 6 instance
15:44 - Created ConfigMap and Secret for auxiliary Deployments, containing environment variables pointing to Redis 6 instance
15:59 - INCIDENT BEGINS Created auxiliary Deployments, that were deployed with release version 1.2 instead of 1.20
16:17 - Updated environment variables, pointing to Redis 6 instance
16:19 - Redis 6 queue began to fill up progressively
16:23 - OUTAGE BEGINS The application stopped working
16:39 - Deleted auxiliary Deployments
16:46 - Redis 6 queue reached 309K length
16:46 - OUTAGE MITIGATED Environment variables restored to previous Redis 5 instance
16:46 - OUTAGE ENDS All services restored and working correctly
17:08 - Followed procedure to continue with the Redis 6 migration
17:10 - After running GitLab CI pipeline to create auxiliary Deployments, container images were manually set to release 1.20
17:11 - Redis 6 queue started to decrease
17:15 - INCIDENT ENDS Redis 6 queue emptied
17:18 - Original Deployments moved to Redis 6 instance
17:31 - Deleted auxiliary Deployments
17:44 - Redis 6 migration completed

Action Items as result of Postmortem

Investigate GitLab CI bug
Update Redis migration playbook
Recompose internal metrics during the incident

Posted Jun 27, 2022 - 08:56 CEST

Resolved

This incident has been resolved

Posted Jun 20, 2022 - 17:10 CEST

Monitoring

A fix has been implemented and we are monitoring results.

Posted Jun 20, 2022 - 16:51 CEST

Update

We are continuing to work on a fix for this issue.

Posted Jun 20, 2022 - 16:50 CEST

Identified

The issue has been identified and a fix is being implemented.

Posted Jun 20, 2022 - 16:49 CEST

Update

We are continuing to investigate this issue.

Posted Jun 20, 2022 - 16:48 CEST

Investigating

There is an ongoing incident affecting the Landbot app. The bots are working correctly. We are investigating and working on the issue with maximum priority.

Posted Jun 20, 2022 - 16:48 CEST

This incident affected: Platform (Builder).