Basecamp 3 faces a read-only outage of nearly 5 hours

Yesterday, Basecamp shared the cause behind the outage Basecamp 3 faced on November 8. The outage continued for nearly five hours starting from 7:21 am CST to 12:11 pm. Due to this, the users were only able to access existing messages, to-do lists, and files, but they were prevented from entering any new information and altering any existing information.

David Heinemeier Hansson, the creator of Ruby on Rails, founder & CTO at Basecamp said in his post that this was the worst outage Basecamp has faced in probably 10 years:

“It’s bad enough that we had the worst outage at Basecamp in probably 10 years, but to know that it was avoidable is hard to swallow. And I cannot express my apologies clearly or deeply enough.”

https://twitter.com/basecamp/status/1060554610241224705

Key causes behind the Basecamp 3 outage

Every activity that a user does is tracked in Basecamp’s events table, whether it is posting a message, updating a to-do list, or applauding a comment. The root cause behind the Basecamp going into read-only mode was its database hitting the ceiling of 2,147,483,647 on this very busy events table.

Secondly, the programming framework that Basecamp uses, Ruby on Rails updated their default for database tables in version 5.1 released in 2017. This update lifted the headroom for records from 2,147,483,647 to 9,223,372,036,854,775,807 on all tables. But, the column in the database was configured as an integer rather than a big integer.

The complete timeline of the outage

Time	Activity
7:21 am CST	They ran out of ID numbers on the events table in the database because the column in the database was configured as an integer rather than a big integer. The integer runs out of numbers at 2147483647 and big integer can grow until 9223372036854775807.
7:29 am CST	The team started working on database migration where they updated the column type from the regular integer to the big integer type. They later tested this fix on a staging database to make sure it was safe.
7:52 am CST	The test done on the staging database verified that the fix was correct, so they moved on to make the changes to the production database table. Due to the huge size of the production database, the migration was estimated to take about one hour and forty minutes.
10:56 am CST-11:52 am CST	The upgrade to the database was completed, but still, verification of all the data, and configurations update was required to ensure no other problems are faced when it is back online.
12:22 pm CST	After the successful verification, Basecamp came back online.
12:33 pm CST	Basecamp went down again because of the intense load of the application was back online, which caused the caching server to get overwhelmed.
12:41 pm CST	Basecamp came back online after they switched over to the backup caching servers.