Downtime during component migration
Incident Report for Landbot

Incident summary

Between the hour of 16:23 and 16:46 CEST on the 20th of June, a cascading failure during a Redis migration to version 6 affected some deployments, resulting in a service outage.


This incident affected some customers, who experienced a downtime in the application for 23 minutes.

‌Root Causes

During a Redis migration, a bug in the GitLab CI pipeline was identified. This bug truncates the trailing 0 of the container image tag, so it deployed a release version 1.2 instead of 1.20, causing a CrashLoopBackOff error for one of our Deployment containers. The environment variables for the Deployments were moved to the new Redis 6 instance when the Deployments were still in CrashLoopBackOff.


One of our Deployment’s ConfigMap was modified, changing the Redis server environment variable to the new Redis 6 instance. The restart of these Deployments updated their ConfigMaps, when one of them were still in a CrashLoopBackOff status, causing a cascading failure that affected other Deployments and triggering the incident.


The Redis 6 queue started to fill up with tasks ahead of time, and the application stopped working.


2022-06-20 (all times are CEST)

  • 15:29 - Created new Redis 6 instance
  • 15:44 - Created ConfigMap and Secret for auxiliary Deployments, containing environment variables pointing to Redis 6 instance
  • 15:59 - INCIDENT BEGINS Created auxiliary Deployments, that were deployed with release version 1.2 instead of 1.20
  • 16:17 - Updated environment variables, pointing to Redis 6 instance
  • 16:19 - Redis 6 queue began to fill up progressively
  • 16:23 - OUTAGE BEGINS The application stopped working
  • 16:39 - Deleted auxiliary Deployments
  • 16:46 - Redis 6 queue reached 309K length
  • 16:46 - OUTAGE MITIGATED Environment variables restored to previous Redis 5 instance
  • 16:46 - OUTAGE ENDS All services restored and working correctly
  • 17:08 - Followed procedure to continue with the Redis 6 migration
  • 17:10 - After running GitLab CI pipeline to create auxiliary Deployments, container images were manually set to release 1.20
  • 17:11 - Redis 6 queue started to decrease
  • 17:15 - INCIDENT ENDS Redis 6 queue emptied
  • 17:18 - Original Deployments moved to Redis 6 instance
  • 17:31 - Deleted auxiliary Deployments
  • 17:44 - Redis 6 migration completed

Action Items as result of Postmortem

  • Investigate GitLab CI bug
  • Update Redis migration playbook
  • Recompose internal metrics during the incident
Posted Jun 27, 2022 - 08:56 CEST

This incident has been resolved
Posted Jun 20, 2022 - 17:10 CEST
A fix has been implemented and we are monitoring results.
Posted Jun 20, 2022 - 16:51 CEST
We are continuing to work on a fix for this issue.
Posted Jun 20, 2022 - 16:50 CEST
The issue has been identified and a fix is being implemented.
Posted Jun 20, 2022 - 16:49 CEST
We are continuing to investigate this issue.
Posted Jun 20, 2022 - 16:48 CEST
There is an ongoing incident affecting the Landbot app. The bots are working correctly. We are investigating and working on the issue with maximum priority.
Posted Jun 20, 2022 - 16:48 CEST
This incident affected: Platform (Builder).