Downtime during component migration
Incident Report for Landbot
Postmortem

Incident summary

Between the hour of 16:23 and 16:46 CEST on the 20th of June, a cascading failure during a Redis migration to version 6 affected some deployments, resulting in a service outage.

Impact

This incident affected some customers, who experienced a downtime in the application for 23 minutes.

‌Root Causes

During a Redis migration, a bug in the GitLab CI pipeline was identified. This bug truncates the trailing 0 of the container image tag, so it deployed a release version 1.2 instead of 1.20, causing a CrashLoopBackOff error for one of our Deployment containers. The environment variables for the Deployments were moved to the new Redis 6 instance when the Deployments were still in CrashLoopBackOff.

Trigger

One of our Deployment’s ConfigMap was modified, changing the Redis server environment variable to the new Redis 6 instance. The restart of these Deployments updated their ConfigMaps, when one of them were still in a CrashLoopBackOff status, causing a cascading failure that affected other Deployments and triggering the incident.

Detection

The Redis 6 queue started to fill up with tasks ahead of time, and the application stopped working.

Timeline

2022-06-20 (all times are CEST)

  • 15:29 - Created new Redis 6 instance
  • 15:44 - Created ConfigMap and Secret for auxiliary Deployments, containing environment variables pointing to Redis 6 instance
  • 15:59 - INCIDENT BEGINS Created auxiliary Deployments, that were deployed with release version 1.2 instead of 1.20
  • 16:17 - Updated environment variables, pointing to Redis 6 instance
  • 16:19 - Redis 6 queue began to fill up progressively
  • 16:23 - OUTAGE BEGINS The application stopped working
  • 16:39 - Deleted auxiliary Deployments
  • 16:46 - Redis 6 queue reached 309K length
  • 16:46 - OUTAGE MITIGATED Environment variables restored to previous Redis 5 instance
  • 16:46 - OUTAGE ENDS All services restored and working correctly
  • 17:08 - Followed procedure to continue with the Redis 6 migration
  • 17:10 - After running GitLab CI pipeline to create auxiliary Deployments, container images were manually set to release 1.20
  • 17:11 - Redis 6 queue started to decrease
  • 17:15 - INCIDENT ENDS Redis 6 queue emptied
  • 17:18 - Original Deployments moved to Redis 6 instance
  • 17:31 - Deleted auxiliary Deployments
  • 17:44 - Redis 6 migration completed

Action Items as result of Postmortem

  • Investigate GitLab CI bug
  • Update Redis migration playbook
  • Recompose internal metrics during the incident
Posted Jun 27, 2022 - 08:56 CEST

Resolved
This incident has been resolved
Posted Jun 20, 2022 - 17:10 CEST
Monitoring
A fix has been implemented and we are monitoring results.
Posted Jun 20, 2022 - 16:51 CEST
Update
We are continuing to work on a fix for this issue.
Posted Jun 20, 2022 - 16:50 CEST
Identified
The issue has been identified and a fix is being implemented.
Posted Jun 20, 2022 - 16:49 CEST
Update
We are continuing to investigate this issue.
Posted Jun 20, 2022 - 16:48 CEST
Investigating
There is an ongoing incident affecting the Landbot app. The bots are working correctly. We are investigating and working on the issue with maximum priority.
Posted Jun 20, 2022 - 16:48 CEST
This incident affected: Platform (Builder).