Chaos Conf 2018 Recap: Chaos engineering hits maturity as community moves towards controlled experimentation

Conferences can sometimes be confusing. Even at the most professional and well-planned conferences, you sometimes just take a minute and think what's actually the point of this? Am I learning anything? Am I meant to be networking? Will anyone notice if I steal extra food for the journey home?

chaos-conf-2018-recap-chaos-engineering-hits-maturity-as-community-moves-towards-controlled-experimentation-img-0

Chaos Conf 2018 was different, however. It had a clear purpose: to take the first step in properly forging a chaos engineering community.

After almost a decade somewhat hidden in the corners of particularly innovative teams at Netflix and Amazon, chaos engineering might feel that its time has come. As software infrastructure becomes more complex, less monolithic, and as business and consumer demands expect more of the software systems that have become integral to the very functioning of life, resiliency has never been more important but more challenging to achieve.

chaos-conf-2018-recap-chaos-engineering-hits-maturity-as-community-moves-towards-controlled-experimentation-img-1

But while it feels like the right time for chaos engineering, it hasn't quite established itself in the mainstream. This is something the conference host, Gremlin, a platform that offers chaos engineering as a service, is acutely aware of. On the hand it's actively helping push chaos engineering into the hands of businesses, but on the other its growth and success, backed by millions of VC cash (and faith), depends upon chaos engineering becoming a mainstream discipline in the DevOps and SRE worlds.

It's perhaps this reason that the conference felt so important. It was, according to Gremlin, the first ever public chaos engineering conference. And while it was relatively small in the grand scheme of many of today's festival-esque conferences attended by thousands of delegates (Dreamforce, the Salesforce conference, was also running in San Francisco in the same week), the fact that the conference had quickly sold out all 350 of its tickets - with more hoping on waiting lists - indicates that this was an event that had been eagerly awaited.

And with some big names from the industry - notably Adrian Cockcroft from AWS and Jessie Frazelle from Microsoft - Chaos Conf had the air of an event that had outgrown its insider status before it had even began. The renovated cinema and bar in San Francisco's Mission District, complete with pinball machines upstairs, was the perfect container for a passionate community that had grown out of the clean corporate environs of Silicon Valley to embrace the chaotic mess that resembles modern software engineering.

Kolton Andrus sets out a vision for the future of Gremlin and chaos engineering

Chaos Conf was quick to deliver big news. They keynote speech, by Gremlin co-founder Kolton Andrus launched Gremlin's brand new Application Level Fault Injection (ALFI) feature, which makes it possible to run chaos experiments at an application level.

Andrus broke the news by building towards it with a story of the progression of chaos engineering. Starting with Chaos Monkey, the tool first developed by Netflix, and moving from infrastructure to network, he showed how, as chaos engineering has evolved, it requires and faciliates different levels of control and insight on how your software works.

"As we're maturing, the host level failures and the network level failures are necessary to building a robust and resilient system, but not sufficient. We need more - we need a finer granularity," Andrus explains.

This is where ALFI comes in. By allowing Gremlin users to inject failure at an application level, it allows them to have more control over the 'blast radius' of their chaos experiments.

The narrative Andrus was setting was clear, and would ultimately inform the ethos of the day - chaos engineering isn't just about chaos, it's about controlled experimentation to ensure resiliency. To do that requires a level of intelligence - technical and organizational - about how the various components of your software work, and how humans interact with them.

Adrian Cockcroft on the importance of historical context and domain knowledge

Adrian Cockcroft (@adrianco) VP at AWS followed Andrus' talk. In it he took the opportunity to set the broader context of chaos engineering, highlighting how tackling system failures is often a question of culture - how we approach system failure and think about our software.

chaos-conf-2018-recap-chaos-engineering-hits-maturity-as-community-moves-towards-controlled-experimentation-img-2

Developers love to learn things from first principles" he said. "But some historical context and domain knowledge can help illuminate the path and obstacles."

If this sounds like Cockcroft was about to stray into theoretical territory, he certainly didn't. He offered plenty of practical frameworks for thinking through potential failure.

But the talk wasn't theoretical - Cockcroft offered a taxonomy of failure that provides a useful framework for thinking through potential failure at every level.

He also touched on how he sees the future of resiliency evolving, focusing on:

Observability of systems

Epidemic failure modes

Automation and continuous chaos

The crucial point Cockcroft makes is that cloud is the big driver for chaos engineering. "As datacenters migrate to the cloud, fragile and manual disaster recovery will be replaced by chaos engineering" read one of his slides. But more than that, the cloud also paves the way for the future of the discipline, one where 'chaos' is simply an automated part of the test and deployment pipeline.

Selling chaos engineering to your boss

Kriss Rochefolle, DevOps engineer and author of one of the best selling DevOps books in French, delivered a short talk on how engineers can sell chaos to their boss.

He takes on the assumption that a rational proposal, informed by ROI is the best way to sell chaos engineering. He suggests instead that engineers need to play into emotions, and presenting chaos engineer as a method for tackling and minimizing the fear of (inevitable failure.

Follow Kriss on Twitter: @crochefolle

Walmart and chaos engineering

Vilas Veraraghavan, the Director of Engineering was keen to clarify that Walmart doesn't practice chaos. Rather it practices resiliency - chaos engineering is simply a method the organization uses to achieve that.

It was particularly useful to note the entire process that Vilas' team adopts when it comes to resiliency, which has largely developed out of Vilas' own work building his team from scratch.

You can learn more about how Walmart is using chaos engineering for software resiliency in this post.

Twitter's Ronnie Chen on diving and planning for failure

Ronnie Chen (@rondoftw) is an engineering manager at Twitter. But she didn't talk about Twitter. In fact, she didn't even talk about engineering. Instead she spoke about her experience as a technical diver.

By talking about her experiences, Ronnie was able to make a number of vital points about how to manage and tackle failure as a team. With mortality rates so high in diving, it's a good example of the relationship between complexity and risk.

Chen made the point that things don't fail because of a single catalyst. Instead, failures - particularly fatal ones - happen because of a 'failure cascade'. Chen never makes the link explicit, but the comparison is clear - the ultimate outcome (ie. success or failure) is impacted by a whole range of situational and behavioral factors that we can't afford to ignore.

chaos-conf-2018-recap-chaos-engineering-hits-maturity-as-community-moves-towards-controlled-experimentation-img-3

Chen also made the point that, in diving, inexperienced people should be at the front of an expedition.

"If you're inexperienced people are leading, they're learning and growing, and being able to operate with a safety net... when you do this, all kinds of hidden dependencies reveal themselves... every undocumented assumption, every piece of ancient team lore that you didn't even know you were relying on, comes to light."

Charity Majors on the importance of observability

Charity Majors (@mipsytipsy), CEO of Honeycomb, talked in detail about the key differences between monitoring and observability. As with other talks, context was important: a world where architectural complexity has grown rapidly in the space of a decade.

Majors made the point that this increase in complexity has taken us from having known unknowns in our architectures, to many more unknown unknowns in a distributed system. This means that monitoring is dead - it simply isn't sophisticated enough to deal with the complexities and dependencies within a distributed system. Observability, meanwhile, allows you to to understand "what's happening in your systems just by observing it from the outside." Put simply, it lets you understand how your software is functioning from your perspective - almost turning it inside out.

Majors then linked the concept to observability to the broader philosophy of chaos engineering - echoing some of the points raised by Adrian Cockcroft in his keynote.

But this was her key takeaway:

"Software engineers spend too much time looking at code in elaborately falsified environments, and not enough time observing it in the real world."

This leads to one conclusion - the importance of testing in production. "Accept no substitute."

Tammy Butow and Ana Medina on making an impact

Tammy Butow (@tammybutow) and Ana Medina (@Ana_M_Medina) from Gremlin took us through how to put chaos engineering into practice - from integrating it into your organizational culture to some practical tests you can run.

One of the best examples of putting chaos into practice is Gremlin's concept of 'Failure Fridays', in which chaos testing becomes a valuable step in the product development process, to dogfood it and test out how a customer experiences it.

Another way which Tammy and Ana suggested chaos engineering can be used was as a way of testing out new versions of technologies before you properly upgrade in production.

To end, their talk, they demo'd a chaos battle between EKS (Kubernetes on AWS) and AKS (Kubernetes on Azure), doing an app container attack, a packet loss attack and a region failover attack.

Jessie Frazelle on how containers can empower experimentation

Jessie Frazelle (@jessfraz) didn't actually talk that much about chaos engineering. However, like Ronnie Chen's talk, chaos engineering seeped through what she said about bugs and containers.

Bugs, for Frazelle, are a way of exploring how things work, and how different parts of a software infrastructure interact with each other:

"Bugs are like my favorite thing... some people really hate when they get one of those bugs that turns out to be a rabbit hole and your kind of debugging it until the end of time... while debugging those bugs I hate them but afterwards, I'm like, that was crazy!"

This was essentially an endorsement of the core concept of chaos engineering - injecting bugs into your software to understand how it reacts.

Jessie then went on to talk about containers, joking that they're NOT REAL. This is because they're made up of numerous different component parts, like Cgroups, namespaces, and LSMs. She contrasted containers with Virtual machines, zones and jails, which are 'first class concepts' - in other worlds, real things (Jessie wrote about this in detail last year in this blog post).

chaos-conf-2018-recap-chaos-engineering-hits-maturity-as-community-moves-towards-controlled-experimentation-img-5

In practice what this means is that whereas containers are like Lego pieces, VMs, zones, and jails are like a pre-assembled lego set that you don't need to play around with in the same way.

From this perspective, it's easy to see how containers are relevant to chaos engineering - they empower a level of experimentation that you simply don't have with other virtualization technologies. "The box says to build the death star. But you can build whatever you want."

The chaos ends...

Chaos Conf was undoubtedly a huge success, and a lot of credit has to go to Gremlin for organizing the conference. It's clear that the team care a lot about the chaos engineering community and want it to expand in a way that transcends the success of the Gremlin platform.

While chaos engineering might not feel relevant to a lot of people at the moment, it's only a matter of time that it's impact will be felt. That doesn't mean that everyone will suddenly become a chaos engineer by July 2019, but the cultural ripples will likely be felt across the software engineering landscape.

But without Chaos Conf, it would be difficult to see chaos engineering growing as a discipline or set of practices. By sharing ideas and learning how other people work, a more coherent picture of chaos engineering started to emerge, one that can quickly make an impact in ways people wouldn't have expected six months ago.

You can watch videos of all the talks from Chaos Conf 2018 on YouTube.