July 1 TeamCity Cloud Outage Postmortem
On July 1, 2022 TeamCity Cloud experienced an outage that prevented our customers from running builds on build agents hosted by JetBrains. Here is the summary of the incident and the remediation actions.
- Hosted agents
- Billing system (credits calculation, new server assignments)
- 3:05 PM – An incident was detected on Production both by us and by clients.
- 3:20 PM – The root cause of the agent outage was identified as an overloaded RabbitMQ node which stopped serving traffic .
- 3:23 PM – An attempt was made to restore and scale the RabbitMQ cluster.
- 3:30 PM – The attempt to restore the cluster failed and led to complete cluster degradation.
- 3:33 PM – Billing system services started to fail health checks and the whole system degraded.
- 3:45 PM – A RabbitMQ cluster rebuild was started.
- 3:51 PM – Remediation steps were identified and implemented in the automation system.
- 3:55 PM – Remediation started.
- 4:20 PM – The hosted agents’ availability was restored, and the billing system was partially restored.
- 5:30 PM – The billing system was fully restored.
Midday on Friday (July 1) we started to notice anomalies in agent metrics. At 3:05 PM, the issues were visible both for us and for TeamCity Cloud clients – build queue processing had stopped. Clients who reported build issues received errors related to agent upgrades. In general, agents require a quick upgrade (a matter of seconds) and normally they just get assigned to a server transparently. However, that was not the case. Agents did register on TeamCity servers, but they were marked as “Upgrade Required” and were not processing any builds. At this point, a small percentage of servers were affected, but we saw that the problem was growing rapidly.
By 3:20 PM we had identified the root cause – A node in the RMQ cluster was out of RAM. The RMQ cluster didn’t refuse connections as expected. Instead it held them, leading to hanging threads on the TeamCity servers’ side. Only one RMQ node was affected, therefore only a small portion of tenants was impacted.
Our design entails that “secondary” services should not affect critical functionality, such as JetBrains-hosted build agents. But as we found out during the post-incident analysis, there was a bug in the code that was producing too many threads while waiting for an RMQ connection. This caused an exhaustion of the servers’ thread pool, which prevented build queue processing. The bug has already been fixed and the fix will be deployed to Production during the next upgrade.
At 3:23 we made an attempt to restore the overloaded RMQ node by scaling the underlying VM with increased limits. However, we also started to notice that other nodes were not handling the excess load and were failing to serve network traffic as well. We decided to stop the cluster for a quicker scale-up and remediation.
At 3:30 we identified that, although new VMs had started, RMQ nodes did not have the queues, users and exchanges definitions. Further investigation has shown that our RMQ cluster used EBS (local storage) rather than EFS (shared storage) for Mnesia. Due to the lack of internal documentation regarding RMQ, our decision to restart all nodes led to the loss of data and definitions. Ultimately that caused degradation of our billing system, as it’s heavily reliant on cross-service communication based on AMQP.
By 3:51 PM we had identified the remediation steps (once we verified that the new cluster was up and running):
- Re-run shared resource provisioning, including shared queues, exchanges and policies.
- Validate and amend the state of tenant-specific infrastructure, such as tenant-specific queues.
- Run a partial update procedure for each tenant, which would ensure the availability of users/queues/policies for each tenant to properly communicate with our core services (and billing system services).
At 3:55 PM we started the automation according to the remediation steps we had defined.
By 4:20 PM core functionality such as JetBrains-hosted agents provisioning had been fully restored and the issue was visible as “resolved” to most TeamCity Cloud tenants; however, billing was still partially unavailable.
Between 3:55 and 5:30 we rolled out the fix for every tenant, which restored billing system functionality completely. In other words, the incident timeline was concluded.