2 min read

As software systems become more distributed, reliability and resiliency have become more and more important. This is one of the reasons why we’ve seen the emergence of chaos engineering – unreliability causes downtime which, in turn, also causes downtime. And downtime costs money.

The impact of downtime is particularly significant for huge organizations that depend on the resilience and reliability of their platforms and applications. Take Uber – not only does the simplicity of the user experience hide its astonishing complexity, but it also has to ensure it can manage that complexity in a way that’s reliable. A ride-hailing app couldn’t be anywhere near as successful as Uber if it didn’t work even if it had 1% downtime.

Building resilient software is difficult

But actually building resilient systems is difficult. We’ve recently seen how Uber uses distributed tracing to build more observable systems which can help improve reliability and resiliency in the last podcast episode with Yuri Shkuro but in this week’s podcast we’re diving even deeper into resiliency with Vilas Veeraraghavan, who’s Director of Engineering at Walmart Labs.

Vilas has experience at Netflix, the company where chaos engineering originated, but at Walmart, he’s been playing a central role in bringing a more evolved version of chaos engineering – which Vilas calls resiliency engineering – to the organization.

In this episode we discuss:

  • Whether chaos engineering and resiliency engineering are for everyone
  • Cultural challenges
  • How to get buy-in
  • Getting tooling right

 

“You do not want to get up in the middle of the night get on the call with the VP of engineering and blurt out saying I have no idea what happened. Your answer should be I know exactly what happened because we have tested this exact scenario multiple times. We developed a recipe for it, and here is what we can do… that gives you as an engineer, the power to be able to stand up and say I know exactly what’s going on, I’ll fix it, don’t worry, we’re not going to cause an outage.”

Co-editor of the Packt Hub. Interested in politics, tech culture, and how software and business are changing each other.