Yesterday, GitLab suffered major performance degradation in terms of 5x increased error rate and site slow down. The degradation was identified and rectified within few hours of its discovery.
— Gabriel Chuan (@gabrielchuan) July 1, 2019
— yeeeeehaw 🤠 (@lordapo_) July 1, 2019
The GitLab engineers promptly started investigating the slowdown on GitLab.com and notified users that the slow down is in redis and lru cluster, thus impacting all web requests serviced by the rails front-end. What followed next was a very comprehensive detailing about the issue, its causes, who’s handling what kind of issue and more. GitLab’s step by step response looked like this:
- First, they investigated slow response times on GitLab.
- Next, they added more workers to alleviate the symptoms of the incident.
- Then, they investigated jobs on shared runners that were being picked up at a low rate or appeared being stuck.
- Next, they tracked CI issues and observed performance degradation as one incident.
- Over the time, they continued to investigate the degraded performance and CI pipeline delays.
- After a few hours, all services restored to normal operation and the CI pipelines continued to catch up from delays earlier with nearly normal levels.
David Smith, the Production Engineering Manager at GitLab also updated users that the performance degradation was due to few issues tied to redis latency.
Smith also added that, “We have been looking into the details of all of the network activity on redis and a few improvements are being worked on. GitLab.com has mostly recovered.”
Many users on Hacker News wrote about their unpleasant experience with GitLab.com.
A user states that, “I recently started a new position at a company that is using Gitlab. In the last month I’ve seen a lot of degraded performance and service outages (especially in Gitlab CI). If anyone at Gitlab is reading this – please, please slow down on chasing new markets + features and just make the stuff you already have work properly, and fill in the missing pieces.”
Another user comments, “Slow down, simplify things, and improve your user experience. Gitlab already has enough features to be competitive for a while, with the Github + marketplace model.”
Later, a GitLab employee by the username, kennyGitLab commented that GitLab is not losing sight and is just following the company’s new strategy of ‘Breadth over depth’. He further added that, “We believe that the company plowing ahead of other contributors is more valuable in the long run. It encourages others to contribute to the polish while we validate a future direction. As open-source software we want everyone to contribute to the ongoing improvement of GitLab.”
Users were indignant by this response.
A user commented, “”We’re Open Source!” isn’t a valid defense when you have paying customers. That pitch sounds great for your VCs, but for someone who spends a portion of their budget on your cloud services – I’m appalled. Gitlab is a SaaS company who also provides an open source set of software. If you don’t want to invest in supporting up time – then don’t sell paid SaaS services.”
Another comment read, “I think I understand the perspective, but the messaging sounds a bit like, ‘Pay us full price while serving as our beta tester; sacrifice the needs of your company so you can fulfill the needs of ours’.”
Few users also praised GitLab for prompt action and for providing everybody with in-depth detailing about the investigation.
A user wrote that, “This is EXACTLY what I want to see when there’s a service disruption.
A live, in-depth view of who is doing what, any new leads on the issue, multiple teams chiming in with various diagnostic stats, honestly it’s really awesome.
I know this can’t be expected from most businesses, especially non-open sourced ones, but it’s so refreshing to see this instead of the typical “We’re working on a potential service disruption” that we normally get.”