Key trends in software infrastructure in 2019: observability, chaos, and cloud complexity

Software infrastructure has, over the last decade or so, become a key concern for developers of all stripes. Long gone are narrowly defined job roles; thanks to DevOps, accountability for how code is now shared between teams on both development and deployment sides. For anyone that’s ever been involved in the messy frustration of internal code wars, this has been a welcome change.

But as developers who have traditionally sat higher up the software stack dive deeper into the mechanics of deploying and maintaining software, for those of us working in system administration, DevOps, SRE, and security (the list is endless, apologies if I’ve forgotten you), the rise of distributed systems only brings further challenges. Increased complexity not only opens up new points of failure and potential vulnerability, at a really basic level it makes understanding what’s actually going on difficult.

And, essentially, this is what it will mean to work in software delivery and maintenance in 2019. Understanding what’s happening, minimizing downtime, taking steps to mitigate security threats - it’s a cliche, but finding strategies to become more responsive rather than reactive will be vital.

Indeed, many responses to these kind of questions have emerged this year. Chaos engineering and observability, for example, have both been gaining traction within the SRE world, and are slowly beginning to make an impact beyond that particular job role.
But let’s take a deeper look at what is really going to matter in the world of software infrastructure and architecture in 2019.

Observability and the rise of the service mesh

Before we decide what to actually do, it’s essential to know what’s actually going on. That seems obvious, but with increasing architectural complexity, that’s getting harder. Observability is a term that’s being widely thrown around as a response to this - but it has been met with some cynicism.

For some developers, observability is just a sexed up way of talking about good old fashioned monitoring. But although the two concepts have a lot in common, observability is more of an approach, a design pattern maybe, rather than a specific activity.

This post from The New Stack explains the difference between monitoring and observability incredibly well. Observability is “a measure of how well internal states of a system can be inferred from knowledge of its external outputs.” which means observability is more a property of a system, rather than an activity.

There are a range of tools available to help you move towards better observability. Application management and logging tools like Splunk, Datadog, New Relic and Honeycomb can all be put to good use and are a good first step towards developing a more observable system.

Want to learn how to put monitoring tools to work? Check out some of these titles:

key-trends-in-software-infrastructure-in-2019-img-0 AWS Application Architecture and Management [Video]

Hands on Microservices Monitoring and Testing

Software Architecture with Spring 5.0

As well as those tools, if you’re working with containers, Kubernetes has some really useful features that can help you more effectively monitor your container deployments. In May, Google announced StackDriver Kubernetes Monitoring, which has seen much popularity across the community.

Master monitoring with Kubernetes. Explore these titles:

key-trends-in-software-infrastructure-in-2019-img-3

Google Cloud Platform Administration

Mastering Kubernetes

Kubernetes in 7 Days [Video]

But there’s something else emerging alongside observability which only appears to confirm it’s importance: that thing is the notion of a service mesh. The service mesh is essentially a tool that allows you to monitor all the various facets of your software infrastructure helping you to manage everything from performance to security to reliability.

There are a number of different options out there when it comes to service meshes - Istio, Linkerd, Conduit and Tetrate being the 4 definitive tools out there at the moment.

Learn more about service meshes inside these titles:

key-trends-in-software-infrastructure-in-2019-img-6

Microservices Development Cookbook

The Ultimate Openshift Bootcamp [Video]

Cloud Native Application Development with Java EE [Video]

Why is observability important?

Observability is important because it sets the foundations for many aspects of software management and design in various domains. Whether you’re an SRE or security engineer, having visibility on the way in which your software is working will be essential in 2019.

Chaos engineering

Observability lays the groundwork for many interesting new developments, chaos engineering being one of them. Based on the principle that modern, distributed software is inherently unreliable, chaos engineering ‘stress tests’ software systems.

The word ‘chaos’ is a bit of a misnomer. All testing and experimentation on your software should follow a rigorous and almost scientific structure.

Using something called chaos experiments - adding something unexpected into your system, or pulling a piece of it out like a game of Jenga - chaos engineering helps you to better understand the way it will act in various situations. In turn, this allows you to make the necessary changes that can help ensure resiliency.

Chaos engineering is particularly important today simply because so many people, indeed, so many things, depend on software to actually work. From an eCommerce site to a self driving car, if something isn’t working properly there could be terrible consequences.
It’s not hard to see how chaos engineering fits alongside something like observability.

To a certain extent, it’s really another way of achieving observability. By running chaos experiments, you can draw out issues that may not be visible in usual scenarios.

However, the caveat is that chaos engineering isn’t an easy thing to do. It requires a lot of confidence and engineering intelligence. Running experiments shouldn’t be done carelessly - in many ways, the word ‘chaos’ is a bit of a misnomer. All testing and experimentation on your software should follow a rigorous and almost scientific structure.

While chaos engineering isn’t straightforward, there are tools and platforms available to make it more manageable. Gremlin is perhaps the best example, offering what they describe as ‘resiliency-as-a-service’. But if you’re not ready to go in for a fully fledged platform, it’s worth looking at open source tools like Chaos Monkey and ChaosToolkit.

Want to learn how to put the principles of chaos engineering into practice? Check out this title:

key-trends-in-software-infrastructure-in-2019-img-9

Microservice Patterns and Best Practices

Learn the principles behind resiliency with these SRE titles:

Real-World SRE

Practical Site Reliability Engineering

Better integrated security and code testing

Both chaos engineering and observability point towards more testing. And this shouldn’t be surprising: testing is to be expected in a world where people are accountable for unpredictable systems.

But what’s particularly important is how testing is integrated. Whether it’s for security or simply performance, we’re gradually moving towards a world where testing is part of the build and deploy process, not completely isolated from it.

There are a diverse range of tools that all hint at this move. Archery, for example, is a tool designed for both developers and security testers to better identify and assess security vulnerabilities at various stages of the development lifecycle. With a useful dashboard, it neatly ties into the wider trend of observability. ArchUnit (sounds similar but completely unrelated) is a Java testing library that allows you to test a variety of different architectural components.

Similarly on the testing front, headless browsers continue to dominate. We’ve seen some of the major browsers bringing out headless browsers, which will no doubt delight many developers. Headless browsers allow developers to run front end tests on their code as if it were live and running in the browser. If this sounds a lot like PhantomJS, that’s because it is actually quite a bit like PhantomJS. However, headless browsers do make the testing process much faster.

Smarter software purchasing and the move to hybrid cloud

The key trends we’ve seen in software architecture are about better understanding your software. But this level of insight and understanding doesn’t matter if there’s no alignment between key decision makers and purchasers.

Whatever cloud architecture you have, strong leadership and stakeholder management are essential.

This can manifest itself in various ways. Essentially, it’s a symptom of decision makers being disconnected from engineers buried deep in their software. This is by no means a new problem, cloud coming to define just about every aspect of software, it’s now much easier for confusion to take hold.

The best thing about cloud is also the worst thing - the huge scope of opportunities it opens up. It makes decision making a minefield - which provider should we use? What parts of it do we need? What’s going to be most cost effective? Of course, with hybrid cloud, there's a clear way of meeting those issues. But it's by no means a silver bullet. Whatever cloud architecture you have, strong leadership and stakeholder management are essential.

This is something that ThoughtWorks references in its most recent edition of Radar (November 2018). Identifying two trends they call ‘bounded buy’ and ‘risk commensurate vendor strategy’ ThoughtWorks highlights how organizations can find their SaaS of choice shaping their strategy in its own image (bounded buy) or look to outsource business critical applications, functions or services. T

ThoughtWorks explains:

“This trade-off has become apparent as the major cloud providers have expanded their range of service offerings. For example, using AWS Secret Management Service can speed up initial development and has the benefit of ecosystem integration, but it will also add more inertia if you ever need to migrate to a different cloud provider than it would if you had implemented, for example, Vault”.

Relatedly, ThoughtWorks also identifies a problem with how organizations manage cost. In the report they discuss what they call ‘run cost as architecture fitness function’ which is really an elaborate way of saying - make sure you look at how much things cost.

So, for example, don’t use serverless blindly. While it might look like a cheap option for smaller projects, your costs could quickly spiral and leave you spending more than you would if you ran it on a typical cloud server.

Get to grips with hybrid cloud:

key-trends-in-software-infrastructure-in-2019-img-12

Hybrid Cloud for Architects

Building Hybrid Clouds with Azure Stack

Become an effective software and solutions architect in 2019:

AWS Certified Solutions Architect - Associate Guide

key-trends-in-software-infrastructure-in-2019-img-15

Architecting Cloud Computing Solutions

Hands-On Cloud Solutions with Azure

Software complexity needs are best communicated in a simple language: money

In practice, this takes us all the way back to the beginning - it’s simply the financial underbelly of observability. Performance, visibility, resilience - these matter because they directly impact the bottom line.

That might sound obvious, but if you’re trying to make the case, say, for implementing chaos engineering, or using a any other particular facet of a SaaS offering, communicating to other stakeholders in financial terms can give you buy-in and help to guarantee alignment.

If 2019 should be about anything, it’s getting closer to this fantasy of alignment. In the end, it will keep everyone happy - engineers and businesses