What if it all goes wrong? Part 2 - Fighting failure

In the previous post we got to know our enemy by familiarising ourselves with the sorts of issues we can expect to come up against while developing software systems. In this post we’ll start to look at some practical tools and techniques we can use to fight back against failure.

By far the easiest way to deal with failures is to prevent them from happening in the first place. This involves making decisions that touch on all steps of the development process, from the code itself, to the tools you use, to the culture and processes you have in place.

The key is to put safeguards in place as early in your process as possible, as issues caught earlier are easier to fix. This concept is so important in the context of testing that it has its own name: shift-left testing.

Crucially, you want to catch problems before they hit your live environment. In fact, really what we mean when we talk about preventing issues is preventing them from happening in production. So, how do you actually do that?

Implementation

Let’s start by looking at what you can be doing in terms of the code itself.

Choose an appropriate architecture

When architecting a codebase it’s important to think about maintainability and flexibility. If you fail to give these areas enough thought at the start of a project, you might end up with a codebase that is fragile, hard to understand, and difficult to extend. Over time it will likely become harder and harder to make changes or add new features without breaking things, resulting in more of those pesky issues you’re trying to prevent.

As with most things, there is no “silver bullet” when it comes to writing maintainable and flexible software. Perhaps the most impactful thing you can do is to choose an architectural style that lends itself nicely to these qualities.

Here at Ghyston we’ve had a lot of success with the so-called “Vertical Slices” architecture, where code is organised around features as opposed to application layers. This style of architecture naturally gives you low coupling and high cohesion, which are a magic combination. For a more in-depth exploration see our blog post from Charles Rea on Architecting for maintainability through Vertical Slices.

Use static type-checking

If possible, you should aim to write your code in a typed language, or a language that supports some form of static type-checking. When writing new code this will help to catch any silly mistakes you might make, and when refactoring existing code it gives you much greater confidence that everything still works. It can also serve to make your code more readable, as type annotations express the developer’s intent.

At Ghyston we make heavy use of TypeScript as an alternative to writing plain JavaScript, and have found it an invaluable ally on large and complex codebases. Other languages may not have fully-fledged typed equivalents, but there are often static type-checking tools you can plumb in, such as MyPy for Python.

Don’t reinvent the wheel

The best way to avoid issues in your code is to not write any code at all, and instead make use of existing open source libraries where possible. Established libraries are far less likely to contain issues than code you write yourself, and are often maintained by people who are experts in the relevant problem space. Above all else, you can get into a lot of trouble if you don’t know what you’re doing, especially where things like security are concerned (the phrase “don’t roll your own cryptography” is often thrown around, with good reason).

Write automated tests

A key weapon in preventing issues is an extensive and thorough suite of automated tests. Without a decent set of unit and integration tests you rely solely on manual testing to catch regressions in functionality and prevent bugs. Manual testing is an incredibly important part of the process (as I’ll discuss later), but nobody has the time to perform a full regression test on a large system every time a change is made. With a full suite of tests you have a constant safety net in which to catch and prevent issues before they make their way to production.

For complicated code, writing tests as you go along (loosely following Test Driven Design principles) might actually be the only way to properly design a feature and check all the different edge cases that could occur. Tests are also great at documenting what code is meant to do, which makes it far easier to maintain in the future. In general, code without tests is incredibly hard to maintain - it’s for this reason that Michael Feathers in his book “Working Effectively with Legacy Code” actually defines legacy code in this way.

Tooling

Let’s move slightly away from the actual code itself, and look at how making use of appropriate tooling in the development process is an excellent way to prevent issues, particularly via automation.

Integrate code quality tools

Some IDEs have built-in capabilities to detect and flag up code smells and potential mistakes - others even provide you with an automated “quick-fix” mechanism to clear them up in a couple of keystrokes. Looking beyond IDEs, there are also plenty of command line tools you can use to analyse your code and flag up issues with code quality. One that we use on all our TypeScript codebases here at Ghyston is ESLint, which will automatically find and fix certain issues for you.

These sorts of tools can usually be integrated into your IDE to give you direct feedback and functionality as you’re writing the code, and are easy to incorporate into your continuous integration process to automatically ensure code quality.

Automate as much as possible

Speaking of automation, you should look to automate processes wherever it’s appropriate. With a manual process it’s hard to guarantee that there won’t be issues - in particular humans are prone to making mistakes, or missing steps out. On the other hand, a process that is automated by way of some sort of script is testable, repeatable, and not subject to human error.

For example, you should automate your build and release process to ensure that you don’t accidentally introduce bugs if you miss out a step. Here at Ghyston we often express our build process in code with something like NUKE Build - this allows us to leverage all the power of a fully featured programming language, debug build failures locally, and maintain a full history of the build process in source control.

Culture

We can’t rely entirely on computers to prevent issues - it’s essential that humans play their part too. This means having the right culture, and the right processes in place.

Code reviews

Code reviews are of particular importance in helping us catch issues early. The tools and techniques we’ve mentioned above are great for thinning the herd of potential issues, but there are some things that can only be spotted by a human. Some issues won’t be in the type system, or the structure of the code, but instead in the business logic itself.

Manual testing and QA

Another important step in preventing errors is to perform manual testing on everything that passes from development to production. Again, machines aren’t always capable of validating that things work as expected - sometimes only humans can do that. Developers can and should test their own work manually before sending it on, but it’s vital that there’s also some sort of QA process in place to catch overlooked issues and/or fundamental misunderstandings.

Summary

In this post we looked at some practical tools and techniques you can use to fight back against failures by preventing them where possible.

It’s worth noting that, although these methods are useful by themselves, none are completely effective in isolation. A good analogy is to think of each approach as a slice of Swiss cheese - each has some holes all the way through, but when stacked together they form a solid block. In other words, the greatest level of prevention is achieved when you apply several techniques in unison.

Aiming to prevent failures is a good start, but what if that’s not feasible? In the next post we’ll take a look at some tools and techniques for gracefully handling the failures we can anticipate but have no sensible way of stopping.