Virtualization

Listen: Walmart Labs Director of Engineering Vilas Veeraraghavan talks to us about building for resiliency at one of the biggest retailers on the planet [Podcast]

2 min read

As software systems become more distributed, reliability and resiliency have become more and more important. This is one of the reasons why we’ve seen the emergence of chaos engineering – unreliability causes downtime which, in turn, also causes downtime. And downtime costs money.

The impact of downtime is particularly significant for huge organizations that depend on the resilience and reliability of their platforms and applications. Take Uber – not only does the simplicity of the user experience hide its astonishing complexity, but it also has to ensure it can manage that complexity in a way that’s reliable. A ride-hailing app couldn’t be anywhere near as successful as Uber if it didn’t work even if it had 1% downtime.

Building resilient software is difficult

But actually building resilient systems is difficult. We’ve recently seen how Uber uses distributed tracing to build more observable systems which can help improve reliability and resiliency in the last podcast episode with Yuri Shkuro but in this week’s podcast we’re diving even deeper into resiliency with Vilas Veeraraghavan, who’s Director of Engineering at Walmart Labs.

Vilas has experience at Netflix, the company where chaos engineering originated, but at Walmart, he’s been playing a central role in bringing a more evolved version of chaos engineering – which Vilas calls resiliency engineering – to the organization.

In this episode we discuss:

  • Whether chaos engineering and resiliency engineering are for everyone
  • Cultural challenges
  • How to get buy-in
  • Getting tooling right

 

“You do not want to get up in the middle of the night get on the call with the VP of engineering and blurt out saying I have no idea what happened. Your answer should be I know exactly what happened because we have tested this exact scenario multiple times. We developed a recipe for it, and here is what we can do… that gives you as an engineer, the power to be able to stand up and say I know exactly what’s going on, I’ll fix it, don’t worry, we’re not going to cause an outage.”

Richard Gall

Co-editor of the Packt Hub. Interested in politics, tech culture, and how software and business are changing each other.

Share
Published by
Richard Gall

Recent Posts

Top life hacks for prepping for your IT certification exam

I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…

3 years ago

Learn Transformers for Natural Language Processing with Denis Rothman

Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…

3 years ago

Learning Essential Linux Commands for Navigating the Shell Effectively

Once we learn how to deploy an Ubuntu server, how to manage users, and how…

3 years ago

Clean Coding in Python with Mariano Anaya

Key-takeaways:   Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…

3 years ago

Exploring Forms in Angular – types, benefits and differences   

While developing a web application, or setting dynamic pages and meta tags we need to deal with…

3 years ago

Gain Practical Expertise with the Latest Edition of Software Architecture with C# 9 and .NET 5

Software architecture is one of the most discussed topics in the software industry today, and…

3 years ago