Transparency is underrated in the tech industry. But as software systems grow in complexity and their relationship with the real world becomes increasingly fraught, it nevertheless remains a value worth fighting for.
But to effectively fight for it, it’s essential to remember that transparency is a technological issue, not just a communication one. Decisions about how software is built and why it’s built in the way that it is lie at the heart of what it means to work in software engineering. Indeed, the industry is in trouble if we can’t see just how important those questions are in relation to everything from system reliability to our collective mental health.
Observability, transparency, and humility
One term has recently emerged as a potential solution to these challenges: observability (or o11y as it’s known in the community). This is a word that has been around for some time, but it’s starting to find real purchase in the infrastructure engineering world. There are many reasons for this, but a good deal of credit needs to go to observability platform Honeycomb and its CEO Charity Majors.
Majors has been a passionate advocate for observability for years. You might even say Honeycomb evolved from that passion and her genuine belief that there is a better way for software engineers to work.
With a career history spanning Parse and Facebook (who acquired Parse in 2011), Majors is well placed to understand, diagnose, and solve the challenges the software industry faces in terms of managing and maintaining complex distributed systems designed to work at scale.
“It’s way easier to build a complex system than it is to run one or to understand one,” she told me when I spoke to her in January. “We’re unleashing all these poorly understood complex systems on the world, and later having to scramble to make sense of it.” Majors is talking primarily about her work as a systems engineer, but it’s clear (to me at least) that this is true in lots of ways across tech, from the reliability of mobile apps to the accuracy of algorithms.
And ultimately, impenetrable complexity can be damaging. Unreliable systems, after all, cost money.
The first step, Majors suggests, to counteracting the challenges of distributed systems, is an acceptance of a certain degree of impotence. We need humility.
She talks of “a shift from an era when you could feel like your systems were up and working to one where you have to be comfortable with the fact that it never is.” While this can be “uncomfortable and unsettling for people in the beginning,” in reality it’s a positive step. It moves us towards a world where we build better software with better processes. And, most importantly, it cultivates more respect for people on all sides – engineers and users.
Charity Majors’ (personal) history of observability
Observability is central to Charity Majors’ and Honeycomb’s purpose. But it isn’t a straightforward concept, and it’s also one that has drawn considerable debate in recent months.
Ironically, although the term is all about clarity, it has been mired in confusion, with the waters of its specific meaning being more than a little muddied.
“There are a lot of people in this space who are still invested in ‘oh observability is a generic synonym for telemetry,’” Majors complains. However, she believes that “engineers are hungry for more technical terminology,” because the feeling of having to deal with problems for which you are not equipped – quite literally – is not uncommon in today’s industry.
With all the debate around what observability is, and its importance to Honeycomb, Majors is keen to ensure its definition remains clear.
“When Honeycomb started up… observability was around as a term, but it was just being used as a generic synonym for telemetry… when we started… the hardest thing was trying to think about how to talk about it… because we knew what we were doing was different,” Majors explains.
Experimentation at Parse
The route to uncovering the very specific – but arguably more useful – definition of observability was through a period of sustained experimentation while at Parse. “Around the time we got acquired… I was coming to this horrifying realisation that we had built a system that was basically un-debuggable by some of the best engineers in the world.”
The key challenge for Parse was dealing with the scale of mobile applications. Parse customers would tell Majors and her team that the service was down for them, underlining Parse’s monitoring tools’ lack of capability to pick up these tiny pockets of failure (“Behold my wall of dashboards! They’re all green, everything is fine!” Majors would tell them).
Scuba: The “butt-ugly” tool that formed the foundations of Honeycomb
The monitoring tools Parse was using at the time weren’t that helpful because they couldn’t deal with high-cardinality dimensions. Put simply, if you wanted to look at things on a granular, user by user basis, you just couldn’t do it.
“I tried everything out there… the one thing that helped us get a handle on this problem was this butt-ugly tool inside Facebook that was aggressively hostile to users and seemed very limited in its functionality, but did one thing really well… it let you slice and dice in real time on dimensions of arbitrarily high cardinality.” Despite its shortcomings, this set it apart from other monitoring tools which are “geared towards low cardinality dimensions,” Majors explains.
So, when you’re looking for “needles in a haystack,” as Parse engineers often were, the level of cardinality is essential. “It was like night and day. It went from hours, days, or impossible, to seconds. Maybe a minute.”
Observability: more than just a platform problem
This experience was significant for Majors and set the tone for Honeycomb. Her experience of working with Scuba became a frame for how she would approach all software problems. “It’s not even just about, oh the site is down, debug it, it’s, like, how do I decide what to build?”
It had, she says, “become core to how I experienced the world.”
Over the course of developing Honeycomb, it became clear to Majors that the problems the product was trying to address were actually deep: “a pure function of complexity.”
“Modern infrastructure has become so ephemeral you may not even have servers, and all of our services are far flung and loosely coupled. Some of them are someone else’s service,” Majors says.
“So I realise that everyone is running into this problem and they just don’t have the language for it. All we have is the language of monitoring and metrics when… this is inherently a distributed systems problem, and the reason we can’t fix them is because we don’t have distributed systems tools.”
Towards a definition of observability
Looking over my notes, I realised that we didn’t actually talk that much about the definition of observability. At first I was annoyed, but in reality this is probably a good thing. Observability, I realised, is only important insofar as it produces real world effects on how people work.
From the tools they use to the way they work together, observability, like other tech terms such as DevOps, only really have value to the extent that they are applied and used by engineers.
“Every single term is overloaded in the data space – every term has been used – and I was reading the dictionary definition of the word ‘observability’ and… it’s from control systems and it’s about how much can you understand and reason about the inner workings of these systems just by observing them from the outside. I was like oh fuck, that’s what we need to talk about!”
In reality, then, observability is a pretty simple concept: how much can you understand and reason about the inner workings of these systems just by observing them from the outside.
Read next: How Gremlin is making chaos engineering accessible [Interview]
But things, as you might expect, get complicated when you try and actually apply the concept. It isn’t easy. Indeed, that’s one of the reasons Majors is so passionate about Honeycomb.
Putting observability into practice
Although Majors is a passionate advocate for Honeycomb, and arguably one of its most valuable salespeople, she warns against the tendency for tooling to be viewed as silver bullet solutions to problems.
“A lot of people have been sold this magic spell idea which is that you don’t have to think about instrumentation or explaining your code back to yourself” Majors says. Erroneously, some people will think they “can just buy this tool for millions of dollars that will do it for you… it’s like write code, buy tool, get magic… and it doesn’t actually work, it never has and it never will.”
This means that while observability is undoubtedly a tooling issue, it’s just as much a cultural issue too. With this in mind, you definitely shouldn’t make the mistake of viewing Honeycomb as magic. “It asks more of you up front,” Majors says.
“There is no magic. At no point in the future are you going to get to just write code and lob it over the wall for ops to deal with. Those days are over, and anyone who is telling you anything else is selling you some very expensive magic beans. The systems of the future do require more of developers. They ask you to care a little bit more up front, in terms of instrumentation and operability, but over the lifetime of your code you reap that investment back hundreds or thousands of times over. We’re asking you, and helping you, make the changes you need to deal with the coming Armageddon of complexity.”
Observability is important, but it’s a means to an end: the end goal is to empower software engineers to practice software ownership. They need to own the full lifecycle of their code.
How transparency can improve accountability
Because Honeycomb demands more ‘up front’ from its users, this requires engineering teams to be transparent (with one another) and fully aligned.
Think of it this way: if there’s no transparency about what’s happening and why, and little accountability for making sure things do or don’t happen inside your software, Honeycomb is going to be pretty impotent.
We can only really get to this world when everyone starts to care properly about their code, and more specifically, how their code runs in production. “Code isn’t even interesting on its own… code is interesting when users interact with it,” Majors says. “it has to be in production.”
That’s all well and good (if a little idealistic), but Majors recognises there’s another problem we still need to contend with. “We have a very underdeveloped set of tools and best practices for software ownership in production… we’ve leaned on ops to… be just this like repository of intuition… so you can’t put a software engineer on call immediately and have them be productive…”
Observability as a force for developer well-being
This is obviously a problem that Honeycomb isn’t going to fix. And yes, while it’s a problem the Honeycomb marketing team would love to fix, it’s not just about Honeycomb’s profits. It’s also about people’s well being.
“You should want to have ownership. Ownership is empowering. Ownership gives you the power to fix the thing you know you need to fix and the power to do a good job… People who find ownership is something to be avoided – that’s a terrible sign of a toxic culture.”
The impact of this ‘toxic culture’ manifests itself in a number of ways. The first is the all too common issue of developer burnout.
This is because a working environment that doesn’t actively promote code ownership and accountability, leads to people having to work on code they don’t understand. They might, for example, be working in production environments they haven’t been trained to adequately work with.
“You can’t just ship your code and go home for the night and let ops deal with it,” Majors asserts. “If you ship a change and it does something weird, the best person to find that problem is you. You understand your intent, you have all the context loaded in your head. It might take you 10 minutes to find a problem that would take anyone else hours and hours.”
The second issue is one that many developers will recognise: the concept of the ‘superhero hacker’.
Read next: Don’t call us ninjas or rockstars, say developers
“I remember the days of like… something isn’t working, and we’d sit around just trying random things or guessing… it turns out that is incredibly inefficient. It leads to all these cultural distortions like the superhero hacker who does the best guessing. When you have good tooling, you don’t have to guess. You just look and see.”
Majors continues on this idea: “the source of truth about your systems can’t live in one guy’s head. It has to live in a tool where everyone has access to the same information about the system, one single source of truth… Otherwise you’re gonna have that one guy who can’t go on vacation ever.”
While a cynic might say well she would say that – it’s a product pitch for Honeycomb, they’d ultimately be missing the point. This is undoubtedly a serious issue that’s having a severe impact on our working lives. It leads directly to mental health problems and can even facilitate discrimination based on gender, race, age, and sexuality.
At first glance, that might seem like a stretch. But when you’re not empowered – by the right tools and the right support – you quite literally have less power. That makes it much easier for you to be marginalized or discriminated against.
Complexity stops us from challenging the status quo
The problem really lies with complexity. Complexity has a habit of entrenching problems. It stops us from challenging the status quo by virtue of the fact that we simply don’t know how to.
This is something Majors takes aim at. In particular, she criticises “the incorrect application of complexity to the business problem it solves.” She goes on to say that “when this happens, humans end up plugging the dikes with their thumbs in a continuous state of emergency. And that is terrible for us as humans.”
How Honeycomb practices what it preaches
Majors’ passion for what she believes is evidenced in Honeycomb’s ethos and values. It’s an organization that is quite deliberately doing things differently from both a technical and cultural perspective.
Majors tells me that when Honeycomb started, the intention was to build a team that didn’t rely upon superstar engineers:
“We made the very specific intention to not build a team of just super-senior expert engineers – we could have, they wanted to come work with us, but we wanted to hire some kids out of bootcamp, we wanted to hire a very well rounded team of lots of juniors and intermediates… This was a decision that I made for moral reasons, but I honestly didn’t know if I believed that it would be better, full disclosure – I honestly didn’t have full confidence that it would become the kind of high powered team that I felt so proud to work on earlier in my career. And yet… I am humbled to say this has been the most consistent high-performing engineering team that I have ever had the honor to work with. Because we empower them to collaborate and own the full lifecycle of their own code.”
Breaking open the black boxes that sustain internal power structures
This kind of workplace, where “the team is the unit you care about” is one that creates a positive and empowering environment, which is a vital foundation for a product like Honeycomb. In fact, the relationship between the product and the way the team works behind it is almost mimetic, as if one reflects the other.
Majors says that “we’re baking” Honeycomb’s organizational culture “into the product in interesting ways.”
She says that what’s important isn’t just the question of “how do we teach people to use Honeycomb, but how do we teach people to feel safe and understand their giant sprawling distributed systems. How do we help them feel oriented? How do we even help them feel a sense of safety and security?”
Honeycomb is, according to Majors, like an “outsourced brain.” It’s a product that means you no longer need to worry about information about your software being locked in a single person’s brain, as that information should be available and accessible inside the product.
This gives individuals safety and security because it means that typical power structures, often based on experience or being “the guy who’s been there the longest” become weaker.
Black boxes might be mysterious but they’re also pretty powerful. With a product like Honeycomb, or, indeed, the principles of observability more broadly, that mystery begins to lift, and the black box becomes ineffective.
Honeycomb: building a better way of developing software and developing together
In this context, Liz Fong-Jones’ move to Honeycomb seems fitting.
Fong-Jones (who you can find on Twitter @lizthegrey) was a Staff SRE at Google and a high profile critic of the company over product ethics and discrimination. She announced her departure at the beginning of 2019 (in fact, Fong-Jones started at Honeycomb in the last week of February). By subsequently joining Honeycomb, she left an environment where power was being routinely exploited, for one where the redistribution of power is at the very center of the product vision.
Honeycomb is clearly a product and a company that offers solutions to problems far more extensive and important than it initially thought it would. Perhaps we’re now living in a world where the problems it’s trying to tackle are more profound than they first appear. You certainly wouldn’t want to bet against its success with Charity Majors at the helm.
Follow Charity Majors on Twitter: @mipsytipsy
Learn more about Honeycomb and observability at honeycomb.io. You can try Honeycomb for yourself with a free trial.