Yesterday, Basecamp shared the cause behind the outage Basecamp 3 faced on November 8. The outage continued for nearly five hours starting from 7:21 am CST to 12:11 pm. Due to this, the users were only able to access existing messages, to-do lists, and files, but they were prevented from entering any new information and altering any existing information.
David Heinemeier Hansson, the creator of Ruby on Rails, founder & CTO at Basecamp said in his post that this was the worst outage Basecamp has faced in probably 10 years:
“It’s bad enough that we had the worst outage at Basecamp in probably 10 years, but to know that it was avoidable is hard to swallow. And I cannot express my apologies clearly or deeply enough.”
Basecamp 3 remains in read-only mode while we're fixing the problem. The current estimate for when we're back remains about one hour. Here's an update on everything we know so far: https://t.co/CUqacnabvp
— Basecamp (@basecamp) November 8, 2018
Key causes behind the Basecamp 3 outage
Every activity that a user does is tracked in Basecamp’s events table, whether it is posting a message, updating a to-do list, or applauding a comment. The root cause behind the Basecamp going into read-only mode was its database hitting the ceiling of 2,147,483,647 on this very busy events table.
Secondly, the programming framework that Basecamp uses, Ruby on Rails updated their default for database tables in version 5.1 released in 2017. This update lifted the headroom for records from 2,147,483,647 to 9,223,372,036,854,775,807 on all tables. But, the column in the database was configured as an integer rather than a big integer.
The complete timeline of the outage
Time | Activity |
7:21 am CST | They ran out of ID numbers on the events table in the database because the column in the database was configured as an integer rather than a big integer. The integer runs out of numbers at 2147483647 and big integer can grow until 9223372036854775807. |
7:29 am CST | The team started working on database migration where they updated the column type from the regular integer to the big integer type. They later tested this fix on a staging database to make sure it was safe. |
7:52 am CST | The test done on the staging database verified that the fix was correct, so they moved on to make the changes to the production database table. Due to the huge size of the production database, the migration was estimated to take about one hour and forty minutes. |
10:56 am CST-11:52 am CST | The upgrade to the database was completed, but still, verification of all the data, and configurations update was required to ensure no other problems are faced when it is back online. |
12:22 pm CST | After the successful verification, Basecamp came back online. |
12:33 pm CST | Basecamp went down again because of the intense load of the application was back online, which caused the caching server to get overwhelmed. |
12:41 pm CST | Basecamp came back online after they switched over to the backup caching servers. |
To read the entire update on Basecamp’s outage, check out David Heinemeier Hansson’s post on Medium.
Read Next
Google Kubernetes Engine was down last Friday, users left clueless of outage status and RCA
Azure DevOps outage root cause analysis starring greedy threads and rogue scale units