Stripe’s API degradation RCA found unforeseen interaction of database bugs and a config change led to cascading failure across critical services

On 10th July, Stripe’s API services went down twice, from 16:36–17:02 UTC and again from 21:14–22:47 UTC. Though the services recovered immediately, it had caused significantly elevated error rates and response times. Two days after the incident, i.e., on 12th July, Stripe has shared a root cause analysis on the repeated degradation, as requested by the users.

David Singleton, Stripe CTO describes the summary of API failures as “two different database bugs and a configuration change interacted in an unforeseen way, causing a cascading failure across several critical services.”

What was the cause of Stripe’s first API degradation?

Three months ago, Stripe had upgraded to a new minor version and had performed the necessary testing to maintain a quality assured environment. This included executing a phased production rollout with the less critical as well as the increasingly critical clusters. Though it operated properly for the first three months, on the day of the event, it failed due to the presence of multiple stalled nodes. This occurred due to a shard, which was unable to elect a new primary state.

[box type="shadow" align="" class="" width=""]“Stripe splits data by kind into different database clusters and by quantity into different shards. Each cluster has many shards, and each shard has multiple redundant nodes.”[/box]

As the shard was used widely, its unavailability caused the compute resources for the API to starve and thus resulted in a severe degradation of the API services. The Stripe team detected the failed election within a minute and started incident response within two minutes. The team forced the election of a new primary state, which led to restarting the database cluster. Thus, 27 minutes after the degradation, the Stripe API fully recovered.

What caused Stripe’s API to degrade again?

Once the Stripe’s API recovered, the team started investigating the root cause of the first degradation. They identified a code path in the new version of the database’s election protocol and decided to revert back to the previous known stable version for all the shards of the impacted cluster. This was deployed within four minutes. Until 21.14 UTC, the cluster was working fine. Later, the automated alerts fired indicating that some shards in the cluster were again unavailable, including the shard implicated in the first degradation.

Though the symptoms appeared to be the same, the second degradation was caused due to a different reason. The prior reverted stable version interacted poorly with a configuration change to the production shards. Once the CPU starvation was observed, the Stripe team updated the production configuration and restored the affected shards. Once the shard was verified as healthy, the team began increasing the traffic back up, including prioritizing services as required by user-initiated API requests. Finally, Stripe’s API services were recovered at 22:47 UTC.

Remedial actions taken

The Stripe’s team has undertaken certain measures to ensure such degradation does not occur in the future

An additional monitoring system has been implemented to alert whenever nodes stop reporting replication lag.

Several changes have been introduced to prevent failures of individual shards from cascading across large fractions of API traffic.

Further, Stripe will introduce more procedures and tooling to increase safety using which operators can make rapid configuration changes during incident response.

Reactions to Stripe’s analysis of the API degradation has been mixed. Some users believe that the Stripe team should have focussed more on mitigating the error completely, rather than analysing the situation, at that moment.

A Hacker News comment read, “In my experience customers deeply detest the idea of waiting around for a failure case to re-occur so that you can understand it better. When your customers are losing millions of dollars in the minutes you're down, mitigation would be the thing, and analysis can wait. All that is needed is enough forensic data so that testing in earnest to reproduce the condition in the lab can begin. Then get the customers back to working order pronto.

20 minutes seems like a lifetime if in fact they were concerned that the degradation could happen again at any time. 20 minutes seems like just enough time to follow a checklist of actions on capturing environmental conditions, gather a huddle to make a decision, document the change, and execute on it. Commendable actually, if that's what happened.”

Few users appreciated Stripe’s analysis report.

https://twitter.com/thinkdigitalco/status/1149767229392769024

Visit the Stripe website for a detailed timeline report.

Twitter experienced major outage yesterday due to an internal configuration issue

Facebook, Instagram and WhatsApp suffered a major outage yesterday; people had trouble uploading and sending media files

Cloudflare suffers 2nd major internet outage in a week. This time due to globally deploying a rogue regex rule.