News

Basecamp 3 faces a read-only outage of nearly 5 hours

3 min read

Yesterday, Basecamp shared the cause behind the outage Basecamp 3 faced on November 8. The outage continued for nearly five hours starting from 7:21 am CST to 12:11 pm. Due to this, the users were only able to access existing messages, to-do lists, and files, but they were prevented from entering any new information and altering any existing information.

David Heinemeier Hansson, the creator of Ruby on Rails, founder & CTO at Basecamp said in his post that this was the worst outage Basecamp has faced in probably 10 years:

“It’s bad enough that we had the worst outage at Basecamp in probably 10 years, but to know that it was avoidable is hard to swallow. And I cannot express my apologies clearly or deeply enough.”

Key causes behind the Basecamp 3 outage

Every activity that a user does is tracked in Basecamp’s events table, whether it is posting a message, updating a to-do list, or applauding a comment. The root cause behind the Basecamp going into read-only mode was its database hitting the ceiling of 2,147,483,647 on this very busy events table.

Secondly, the programming framework that Basecamp uses, Ruby on Rails updated their default for database tables in version 5.1 released in 2017. This update lifted the headroom for records from 2,147,483,647 to 9,223,372,036,854,775,807 on all tables. But, the column in the database was configured as an integer rather than a big integer.

The complete timeline of the outage

Time Activity
7:21 am CST They ran out of ID numbers on the events table in the database because the column in the database was configured as an integer rather than a big integer. The integer runs out of numbers at 2147483647 and big integer can grow until 9223372036854775807.
7:29 am CST The team started working on database migration where they updated the column type from the regular integer to the big integer type. They later tested this fix on a staging database to make sure it was safe.
7:52 am CST The test done on the staging database verified that the fix was correct, so they moved on to make the changes to the production database table. Due to the huge size of the production database, the migration was estimated to take about one hour and forty minutes.
10:56 am CST-11:52 am CST The upgrade to the database was completed, but still, verification of all the data, and configurations update was required to ensure no other problems are faced when it is back online.
12:22 pm CST After the successful verification, Basecamp came back online.
12:33 pm CST Basecamp went down again because of the intense load of the application was back online, which caused the caching server to get overwhelmed.
12:41 pm CST Basecamp came back online after they switched over to the backup caching servers.

To read the entire update on Basecamp’s outage, check out David Heinemeier Hansson’s post on Medium.

Read Next

GitHub October 21st outage RCA: How prioritizing ‘data integrity’ launched series of unfortunate events that led to a day-long outage

Google Kubernetes Engine was down last Friday, users left clueless of outage status and RCA

Azure DevOps outage root cause analysis starring greedy threads and rogue scale units

Bhagyashree R

Share
Published by
Bhagyashree R

Recent Posts

Top life hacks for prepping for your IT certification exam

I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…

3 years ago

Learn Transformers for Natural Language Processing with Denis Rothman

Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…

3 years ago

Learning Essential Linux Commands for Navigating the Shell Effectively

Once we learn how to deploy an Ubuntu server, how to manage users, and how…

3 years ago

Clean Coding in Python with Mariano Anaya

Key-takeaways:   Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…

3 years ago

Exploring Forms in Angular – types, benefits and differences   

While developing a web application, or setting dynamic pages and meta tags we need to deal with…

3 years ago

Gain Practical Expertise with the Latest Edition of Software Architecture with C# 9 and .NET 5

Software architecture is one of the most discussed topics in the software industry today, and…

3 years ago