In an ideal world our software systems would work perfectly 100% of the time, but is it realistic to expect that?
We’re so used to technology these days that it can be easy to take things for granted, and feel surprised when things don’t work. But even things that seem simple, such as sending a tweet from your phone, rely on a series of complex systems and processes operating in perfect harmony in order for things to work properly. The bottom line is that software systems are complex, and where complexity is involved things are bound to go wrong despite our best efforts. The sad truth is that failures are inevitable.
So, what should we do? Should we just give up? Of course not. Realising and accepting that errors are bound to happen is the first step towards approaching software development with a view to building more resilient systems.
The aim of this blog series is to give you a broader awareness and understanding of the sorts of things that typically go wrong with software systems, some high-level strategies for building more robust applications, and some practical tools to take with you into your next software project.
In this first part we’ll shine a light on what can, and will, go wrong with software systems. Understanding the sorts of issues we can expect to face is an important first step, which the following quote from “The Art of War” sums up nicely:
It is said that if you know your enemies and know yourself, you will not be imperiled in a hundred battles...
If we’re to have any chance in the fight against failure, we need to know our enemy.
In order to achieve that, we’ll look at a few flavours of failure in the context of what they are and how they can potentially be dealt with. In particular, for a given failure it’s useful to assess whether it’s something we can prevent, handle, and/or mitigate. Before we dive in, it's worth clarifying what we mean by these terms.
I've already told you that failures are inevitable, so you may be slightly suspicious when I talk about preventing them. In practice it's impossible to stop failures altogether, but there are often steps you can take to greatly reduce the likelihood of them occurring - this is what we mean by prevention.
We may not be able to stop things from going wrong, but it's often possible for an application to continue functioning even when they do. In some cases, we can even design our systems to recover automatically. Handling is about catching errors and reacting to them sensibly, so that the whole thing doesn’t blow up at the first sign of failure.
Despite our best efforts, there will always be failures that we don’t manage to prevent or handle. In these situations the important thing is to act fast and minimise the damage done. Mitigation is about identifying problems that slip past us and doing as much as possible to put them right. With these concepts in mind, let’s take a look at a few common types of failure. For the most part mitigation is always an option, so we’ll focus mainly on whether prevention and/or handling are feasible.
If you’re familiar with the fallacies of distributed computing, you’ll know that you shouldn’t assume the network to be reliable. Short of inventing a new, infallible network communication technology and rolling it out to the whole world, there’s not a great deal we can do to prevent network errors. We can, however, build resiliency around any operation that involves a network call, to handle the failures when they inevitably occur.
No operation is instantaneous, and in many cases there will be a set duration of time after which an operation simply gives up and raises an error. It’s sometimes possible to prevent timeouts by upping timeout durations or resolving performance issues, but in many cases a timeout is just a special case of a network error. Regardless of the situation, timeouts can at least be anticipated and handled sensibly.
As with the network, you should never assume an external service to be one hundred percent reliable. If you control the service in question, then there are obviously steps you can take to prevent failures from occurring, but more often than not it’ll be beyond your reach. Depending on how critical the external service is to your system, there are varying degrees to which it’s possible to gracefully handle outages and keep your application running.
Here are many situations involving data in which there’s a high chance of some failure occurring, such as when checking user input or parsing a response from an external system. In these situations it’s not the operation that should be treated as unreliable, but rather the data itself. Bad data can be prevented with appropriate validation at the point of creation, but if you don’t control the source of the data there’s not much you can do. By not trusting any data that isn’t generated by systems you control, you can at least aim to identify bad data and handle it gracefully in your own application.
Code that compiles and runs perfectly without throwing any errors may still be broken, not in the sense that the code breaks, but in the sense that it doesn’t behave as intended. Developers are only human, and there’s always the chance they’ll misinterpret or misunderstand some business requirement. While these types of failure are almost impossible to avoid completely, there are still plenty of things you can do to prevent them from occurring as often. On the other hand, such failures are not well-defined and could crop up anywhere - as such they’re difficult to handle in any particularly sensible way.
Compilers do a good job of catching developer errors before any code even runs in production, but they’re not all-powerful, and some exceptions will only appear when the code actually gets executed. There are often overlooked corner cases or unusual circumstances that conspire to break code that works perfectly ninety-nine percent of the time. As with broken functionality, these types of failure are difficult to eradicate entirely, but appropriate quality checks can go a long way to preventing them. Again, these sorts of errors are by nature unpredictable, which makes handling them gracefully rather tricky.
As your code gets more complicated and the number of moving parts increases, it becomes more and more difficult to ensure that your system behaves in a consistent way, especially when things are happening concurrently. To some extent you can reason about your code and put safeguards in place (such as transactions) to prevent inconsistent states, but this won’t always be possible. Handling such failures in an automatic fashion is even harder, as they’re usually quite subtle and not easily corrected.
An application that works well initially may over time begin to flag as the volume of data and/or usage scales up. Even if you anticipate such issues and work to prevent them, it’s often impossible to accurately predict the load a system will be subject to in production, and there are bound to be situations you can’t foresee. Performance failures are very general in nature, which makes it difficult to handle them with much precision.
In this post we familiarised ourselves with some of the many flavours of failure. While the list above is not exhaustive, it should give you a taste of the sorts of things that can go wrong, and get you thinking about the kinds of failure you need to plan for.
The key takeaway is that these sorts of failures can and will happen - sticking your head in the sand and hoping for the best is not going to change that. It may seem like a rather pessimistic viewpoint to take, but you should assume that everything that can go wrong will, at some point, go wrong.
So, now that we know our enemy, how can we prepare to meet them in battle? In the next few posts we’ll take a look at some tools and techniques you can apply to build resilient systems in practice, starting with what we can do to prevent failures.