Despite considerable hype, chaos engineering doesn’t appear to have yet completely captured the imagination of the wider software engineering world. According to this year’s Skill Up survey, when asked, only 13% of developers said they were excited about it. But that doesn’t mean we should disregard – far from it. Like many of the best trends, it might blow up when we least expect. It might find its way onto your CTOs eyes in just a few months.
As site reliability engineering grows as a discipline, and as businesses start to put a value on downtime, chaos engineering is likely to become a big part of the reliability and resilience toolkit.
Gremlin, chaos engineering, and the end of the age of downtime
“People are expected to always be up” says Matt Fornaciari, co-founder and CTO of Gremlin, a product that offers “failure as a service” to businesses. I spoke to Fornaciari last month to get a deeper insight on Gremlin and the team and ideas behind it. He believes the world has changed in recent years, and the days of service windows when sites would just be taken down for an hour or two for an update or change is over: “that’s unacceptable to people these days.”
Fornaciari isn’t an unbiased observer, of course. The success of Gremlin depends on chaos engineering’s adoption and acceptance. However, he’s not going out on a limb; there’s clear VC interest in Gremlin. At the end of 2017 the company received their first round of funding – more than 7 million USD. It’s a cliche but money does talk – and in this instance it seems to be saying that this approach might change the way we think about building our software.
Arguably, chaos engineering – and by extension Gremlin – is a response to other trends in software. “I’ve seen a lot of signals that this is the way the world’s going”, Fornaciari says. He’s referring here to broader trends like cloud and microservices. He explains that because microservices is all about modularity, and breaking aspects of your software infrastructure into smaller pieces “you end up with nodes in this network” which “adds network complexity.” Consequently, this additional complexity means there is more that can go wrong – it becomes more unreliable.
Gremlin’s bid to democratize chaos engineering
It’s important to note here that chaos engineering has been around for some time – it’s not a radically new methodology. But it’s largely been locked away in some of the world’s biggest tech companies, like Netflix and Amazon. Many of Gremlin’s leaders actually worked at those companies – Fornaciari has worked at Salesforce and Amazon, for example. “The main goal was to democratize chaos engineering… we’ve [the Gremlin team] done it at the bigger companies and we’re like you know what, everyone can benefit from this”.
That is the essential point around chaos engineering. If it’s going to catch on in the mainstream tech world, it needs to be more accessible to different businesses. Fornaciari explains that many of Gremlin’s customers are larger organizations. These are companies for whom downtime is of utmost importance, where a site outage that lasts just an hour could cost thousands of dollars. That said, from a cultural perspective, many organizations find it difficult to adopt this sort of mindset. “Proving the value of something that doesn’t happen,” Fornaciari says, is one of the biggest challenges for Gremlin. This is particularly true when selling their tool.
Pager pain: How Gremlin sells chaos engineering to customers
This is how Gremlin does it: “We have three qualifying questions: do you measure your downtime? Do you have somebody who’s responsible for downtime? And do you actually have a dollar amount tied to it?”
Presumably, for many organizations at least one answer to these questions is “no”. That’s why customer support is so important for Gremlin. “Customer success and developer advocacy are two of our biggest initiatives… I’ve told people as we’re recruiting them that half of our goal as a company is to educate people.”
Gremlin’s challenges as a product and as a business reflect the wider difficulties of managing upwards. The tension between those ‘on the ground’ and those at a more senior and managerial level is one that Gremlin is acutely aware of. This is where a lot of push back comes from, Fornaciari explains:
What we’ve seen so far is just push back from top down – like, why do we need this? We use the term pager pain to define the engineer on call – the closer you are to the ground the closer you are to the on call rotation and the more you feel those pains and the more you believe in this but as you raise up a couple of levels you maybe don’t feel that as much… if you don’t have that measure on uptime – unless someone is on the hook for that at a higher level there’s oftentimes a why do we need this, why are we going to spend money on breaking things.
Pager pain is a nice concept – it captures the tension between different layers of management. It highlights the conflict between ‘what do we need?’ and ‘what can we do?’
Read next: Blockchain can solve tech’s trust issues
Safety, simplicity and security
To successfully sell Gremlin, the way the product is designed is everything. For that reason, the Gremlin team have three tenets built into their product: safety, security, simplicity. When you’ve got a “potentially dangerous tool,” as Fornaciari himself describes it, making sure things are safe and secure is absolutely essential. Arguably, the fact that chaos engineering is so hard to do well might be something that Gremlin can use to its advantage. “One thing we hear when we talk to companies about it is ‘well we’ll go build this ourselves’ and the fact is it’s a really hard thing to do, and a hard thing to do well.”
Gremlin is walking on a bit of a tightrope. On the one hand chaos engineering is for everyone, but on the other it’s difficult and dangerous. It should be accessible, but not too accessible. “One of the reasons we don’t have a free offering is because we are a little worried about protecting our customers not doing any harm to people… I mean, this is essentially giving somebody a potentially dangerous tool.. If they’re not given the proper education then that could be a problem, right?”
Gremlin aren’t the only chaos engineering product out there. As with any trend, there are plenty of software platforms and tools emerging for technologically forward thinking businesses. Fornaciari doesn’t see these as a threat – he’s confident, bullish even, about Gremlin’s place in the market. “There are a lot of tools out there that people can go and use but they really lack the safety and simplicity.” Alongside its philosophy of safety, security and simplicity, a big selling point, according to Fornaciari, is the experience and expertise that is built into Gremlin’s DNA. “We’ve got fifteen years of combined expertise in this space” he says. “Being the experts on it and having built it 3 or 4 times already in different big companies, it sort of gave us this leg up to go out there in the world.”
But while Fornaciari is eager to assert Gremlin’s knowledge, there’s no trace of elitism – sharing knowledge is a core part of the product offering. “We actually built out customer success tooling so we can see if particular attacks fail for them we can actually proactively reach out and be like ‘hey we saw you were trying to do this, maybe you meant to do this’” Fornaciari explains.
Controlled chaos: chaos engineering and the scientific method
Control is central to Gremlin’s philosophy – it’s a combination of the team’s commitment to safety, security and simplicity. In fact, this element of control that distinguishes chaos engineering today, from what went before. Central to Gremlin’s mission to make chaos engineering accessible, is also redefining how it’s done. “If you’re familiar with the netflix chaos monkey mentality of randomly terminating services, well that’s a good start, but safety is really lacking. We talked more about this controlled chaos… this idea that you start fairly small with this small blast radius and then as you become more confident you grow it out and grow it out as opposed to just like ‘cool, let’s just chuck a grenade in here and see what happens.’”
Fornaciari goes on to describe this ‘controlled chaos’ in a surprising way. “It’s much more like the scientific method actually. Applying that method to your infrastructure and your reliability in general.” This approach is essential if you’re going to do chaos engineering well.
How to do chaos engineering effectively
When I ask Fornaciari how engineering teams and businesses can do chaos engineering well he emphasizes the importance of starting with a hypothesis: “You need to have a hypothesis that you’re trying to prove.Throwing random chaos at something is fine – it’ll sort of surface some of the unknown unknowns for you. But really having a hypothesis that you’re trying to prove is the best way to get value out of this [chaos engineering].”
If you’re going to take a scientific approach to testing your infrastructure using ‘chaos experiments’, managing scale is also incredibly important. Don’t run before you can walk is the message. “Keep it very small initially, then you start to grow the blast radius. You definitely want to make sure that you’re starting off with the smallest modicum that you can.”
Given the potential dangers of throwing metaphorical gremlins into your system, starting where your comfortable makes a lot of sense. “Start in staging, start where your comfortable, build your confidence. Make sure your system behaves well in front of non-customer facing traffic before you go out to the world.”
That said, Gremlin have had “some pretty bold customers” who go straight ahead and start running chaos experiments in production. “That was cool. It’s a little scary, but they were confident and they’ve been using Gremlin as part of their system ever since.”
Chaos engineering requires confidence and control
Ultimately, if chaos engineering is going to take off – as Fornaciari believes it will – engineers will need to be incredibly confident. That’s true on a number of levels. You need confidence that you’ll be able to handle a range of experiments and deploy them wisely. But you’ll also need confidence that you can manage the expectations of those in senior management.
It’s not hard to see the value of chaos engineering. As Fornaciari says “if you prevent one outage one time, you’ve saved that money to pay for the tool to make sure it doesn’t happen again.” But it might be hard to find time for it. It might be hard to get buy in and investment in the tools you need to do it.
Gremlin are certainly going to play an important part in helping engineers do that. But one of its biggest challenges – and perhaps one of its most noble missions too – is transforming a culture where people don’t really appreciate ‘pager pain’. If Fornaciari and Gremlin can help solve that, good luck to them.
You can follow Matt Fornaciari on Twitter: @callmeforni