Why moving from a monolithic architecture to microservices is so hard, Gitlab’s Jason Plum breaks it down [KubeCon+CNC Talk]

Last week, at the KubeCon+CloudNativeCon North America 2018, Jason Plum, Sr. software engineer, distribution at GitLab spoke about GitLab, Omnibus, and the concept of monolith and its downsides. He spent the last year working on the cloud native helm charts and breaking out a complicated pile of code.

This article highlights few insights from Jason Plum’s talk on Monolith to Microservice: Pitchforks Not Included at the KubeCon + CloudNativeCon.

Key takeaways

“You could not have seen the future that you live in today, learn from what you've got in the past, learn what's available now and work your way to it.” - Jason Plum

GitLab’s beginnings as the monolithic project provided the means for focused acceleration and innovation.

The need to scale better and faster than the traditional models caused to reflect on our choices, as we needed to grow beyond the current architecture to keep up.

New ways of doing things require new ways of looking at them. Be open minded, and remember your correct choices in the past could not see the future you live in.

“So the real question people don't realize is what is GitLab?”- Jason Plum

Gitlab is the first single application to have the entire DevOps lifecycle in a single Interface.

Omnibus - The journey from a package to a monolith

“We had a group of people working on a single product to binding that and then we took that, we bundled that. And we shipped it and we shipped it and we shipped it and we shipped it and all the twenties every month for the entire lifespan of this company we have done that, that's not been easy. Being a monolith made that something that was simple to do at scale.”- Jason Plum

In the beginning it was simple as Ruby on Rails was on a single codebase and users had to deploy it from source. Just one gigantic code was used but that's not the case these days.

Ruby on Rails is still used for the primary application but now a shim proxy called workhorse is used that takes the heavy lifting away from Ruby. It ensures the users and their API’s are are responsive. The team at GitLab started packaging this because doing everything from source was difficult. They created the Omnibus package which eventually became the gigantic monolith.

Monoliths make sense because...

Adding features is simple

It’s easy as everything is one bundle

Clear focus for Minimum Viable Product (MVP)

Advantages of Omnibus

Full-stack bundle provides all components necessary to use every feature of GitLab.

Simple to install.

Components can be individually enabled/disabled.

East to distribute.

Highly controlled, version locked components.

Guaranteed configuration stability.

The downsides of monoliths

“The problem is this thing is massive” - Jason Plum

The Omnibus package can work on any platform, any cloud and under any distribution. But the question is how many of us would want to manage fleets of VMs? This package has grown so much that it is 1.5 gigabytes and unpacked. It has all the features and is still usable. If a user downloads 500 megabytes as an installation package then it unpacks almost a gigabyte and a half. This package contains everything that is required to run the SaaS but the problem is that this package is massive.

“The trick is Git itself is the reason that moving to cloud native was hard.” - Jason Plum

While using Git, the users run a couple of commands, they push them and deploy the app. But at the core of that command is how everything is handled and how everything is put together. Git works with snapshots of the entire file. The number of files include, every file the user has and every version the user had. It also involves all the indexes and references and some optimizations. But the problem is the more the files, the harder it gets.

“Has anybody ever checked out the Linux tree? You check out that tree, get your coffee, come back check out, any branch I don't care what it is and then dip that against current master. How many files just got read on the file system?” - Jason Plum

When you come back you realize that all the files that are marked as different and between the two of them when you do diff, that information is not stored, it's not greeting and it is not even cutting it out. It is running differently on all of those files.

Imagine how bad that gets when you have 10 million lines of code in a repository that's 15 years old ? That’s expensive in terms of performance. - Jason Plum

Traditional methods - A big problem

“Now let's actually go and make a branch make some changes and commit them right. Now you push them up to your fork and now you go into add if you on an M R. Now it's my job to do the thing that was already hard on your laptop, right? Okay cool, that's one of you, how about 10,000 people a second right do you see where this is going? Suddenly it's harder but why is this the problem?” - Jason Plum

The answer is traditional methods, as they are quite slow. If we have hundreds of things in the fleet, accessing tens of machines that are massive and it still won’t work because the traditional methods are a problem.

Is NFS a solution to this problem?

NFS (Network File System) works well when there are just 10 or 100 people. But if a user is asked to manage an NFS server for 5,000 people, one might rather choose pitchfork. NFS is capable but it can’t work at such a scale.

The Git team now has a mount that has to be on every single node, as the API code and web code and other processes which needs to be functional enough to read the files. The team has previously used Garrett, Lib Git to read the files on the file system. Every time, one reads the file, the whole file used to get pulled. This gave rise to another problem, disk i/o problems. Since, everybody tries to read the disparate set of files, the traffic increases.

“Okay so we have definitely found a scaling limit now we can only push the traditional methods of up and out so far before we realize that that's just not going to work because we don't have big enough pipes, end of line. So now we've got all of this and we've just got more of them and more of them and more of them. And all of a sudden we need to add 15 nodes to the fleet and another 15 nodes to the fleet and another 15 nodes to the fleet to keep up with sudden user demand. With every single time we have to double something the choke points do not grow - they get tighter and tighter” - Jason Plum

The team decided to take a second look at the problem and started working on a project called Gitaly. They took the API calls that the users would make to live Git. So the Git mechanics was sent over a GRPC and then Gitaly was put on the actual file servers. Further the users were asked to call for a diff on whatever they want and then Gitaly was asked for the response. There is no need of NFS now.

“I can send a 1k packet get a 4k response instead of NFS and reading 10,000 files. We centralized everything across and this gives us the ability to actually meet throughput because that pipe that's not getting any bigger suddenly has 1/10 of the traffic going through it.” - Jason Plum

This leaves more space for users to easily get to the file servers and further removes the need of NFS mounts for everything. Incase one node is lost then half of the fleet is not lost in an instant.

How is Gitaly useful?

With Gitaly the throughput requirement significantly reduced.

The service nodes no more need disk access.

It provides optimization for specific problems.

How to solve Git’s performance related issue?

For better optimization and performance it is important to treat it like a service or like a database. The file system is still in use and all of the accesses to the files are on the node where we have the best performance and best caching and there is no issue with regards to the network.

“To take the monolith and rip a chunk out make it something else and literally prop the thing up, but how long are we going to be able to do this?” - Jason Plum

If a user plans to upload something then he/she has to use a file system and which means that NFS hasn't gone away.

Do we really need to have NFS because somebody uploaded a cat picture? Come on guys we can do better than that right?- Jason Plum

The next solution was to take everything as a traditional file that does not get and move into object store as an option. This matters because there is no need to have a file system locally. The files can be handed over to a service that works well. And it could run on Prem in a cloud and can be handled by any number of men and service providers.

Pets cattle is a popular term by CERN which means anything that can be replaced easily is cattle and anything that you have to care and feed for on a regular basis is a pet. The pet could be the stateful information, for example, database.

The problem can be better explained with configuring the Omnibus at scale. If there are hundreds of the VM’s and they are getting installed, further which the entire package is getting installed. So now there are 20 gigabytes per VM. The package needs to be downloaded for all the VM’s which means almost 500 megabytes. All the individual components can be configured out of the Omnibus. But even the load gets spreaded, it will still remain this big. And each of the nodes will at least take two minutes to come up from. So to speed up this process, the massive stack needs to be broken down into chunks and containers so they can be treated as individualized services. Also, there is no need of NFS as the components are no longer bound to the NFS disk. And this process would now take just five seconds instead of two minutes.

A problem called legacy debt, a shared file system expectation which was a bugger. If there are separate containers and there is no shared disk then it could again give rise to a problem.

“I can't do a shared disk because if we do shared disk through rewrite many. What's the major provider that will do that for us on every platform, anybody remember another three-letter problem.” - Jason Plum

Then there came an interesting problem called workhorse, a smart proxy that talks to the UNIX sockets and not TCP. Though this problem got fixed.

Time constraints - another problem

“We can't break existing users and we can't have hiccups we have to think about everything ahead of time plan well and execute.” - Jason Plum

Time constraints is a serious problem for a project’s developers, the development resources milestones, roadmaps deliverables. The new features would keep on coming into the project. The project would keep on functioning in the background but the existing users can’t be kept waiting.

Is it possible to define individual component requirements?

“Do you know how much CPU you need when idle versus when there's 10 people versus literally some guy clicking around and if files because he's one to look at what the kernel would like in 2 6 2 ?”- Jason Plum

Monitoring helps to understand the component requirements. Metrics and performance data are few of the key elements for getting the exact component requirements. Other parameters like network, throughput, load balance, services etc also play an important role. But the problem is how to deal with throughput? How to balance the services? How to ensure that those services are always up?

Then the other question comes up regarding the providers and load balancers as everyone doesn’t want to use the same load balancers or the same services. The system must support all the load balancers from all the major cloud providers and which is difficult.

Issues with scaling

“Maybe 50 percent for the thing that needs a lot of memory is a bad idea. I thought 50 percent was okay because when I ran a QA test against it, it didn't ever use more than 50 percent of one CPU. Apparently when I ran three more it now used 115 percent and I had 16 pounds and it fell over again.” - Jason Plum

It's important to know what things needs to be scaled horizontally and which ones needs to be scaled vertically. To go automated or manual is also a crucial question. Also, it is equally important to understand which things should be configurable and how to tweak them as the use cases may vary from project to project. So, one should know how to go about a test and how to document a test.

Issues with resilience

“What happens to the application when a node, a whole node disappears off the cluster? Do you know how that behaves?” - Jason Plum

It is important to understand which things shouldn't be on the same nodes. But the problem is how to recover it. These things are not known and by the time one understands the problem and the solution, it is too late. We need new ways of examining these issues and for planning the solution.

Jason’s insightful talk on Monolith to Microservice gives a perfect end to the KubeCon + CloudNativeCon and is a must watch for everyone.

Kelsey Hightower on Serverless and Security on Kubernetes at KubeCon + CloudNative

RedHat contributes etcd, a distributed key-value store project, to the Cloud Native Computing Foundation at KubeCon + CloudNativeCon

Oracle introduces Oracle Cloud Native Framework at KubeCon+CloudNativeCon 2018