10x Engineer: Design software for fault tolerance

Saturday Feb 27, 2021 | Series 10x Engineer

The focus is often on resiliency around dependencies or exception handling. In other words when someone else’s code is not working as it should. It is just as important to look at how to deal with unexpected issues in your own code. How you can make sure that you can recover from breaking scenarios that you did not foresee. We’ll focus on some patterns that help you reduce impact of those errors and allow you to focus on a forward fix instead of rollbacks.

Those that have experienced a couple of production issues can probably tell you that the most stressful part is not knowing if you can restore the error or prevent the fallout from those errors. So if you can put your application in a state where you don’t have to worry about things getting worse you are a step ahead already.

So how do we go about thinking in that way in our code? It all revolves around the ‘fail fast’ principle and making sure you can replay those messages or requests. Ideally you’ll also already have the logging and/or metrics in place to pinpoint the issue. Still if you don’t then being able to replay what went wrong without causing issues in the rest of the system will ensure that you can add logging after the fact.

Feel calmer already? Good. Lets have a closer look at some patterns that we can apply to make sure that we’re in good shape for our next issue.

Queues

Queuing systems can make things a lot easier. They often offer or allow you to setup a failure queue (or topic; terminology depending on the framework) or something similar. On the failure queue you park any message or event that can not be processed and needs manual intervention or arbitration. If we add a mechanism to requeue those we can act on any failures in the future that may happen.

Now we just need to make sure our code also reacts to those failures. If the framework of your choice does not provide this already it can be as simple as the following.

public void queueListener(Event event) {
    try {
        // handle event
    } catch(Exception e) {
        // log - always log exceptions!
        // put message on failure queue/topic
    }
}

With this we have our safety net and it is time to leverage that in the rest of our flow. We’re ‘automagically’ safe for a lot of big issues. Database and rest call exceptions that we did not anticipate will end up in our failure queue already.

Note: I try to only handle exceptions at my application boundaries, e.g. rest endpoints, calls to libraries/other services. Leaving me free to throw exceptions when something exceptional happens and I want to stop processing.

The danger now lies within the choices we make within the message processing. This can often come when we make assumptions about allowed default values or when we choose to go for a fallback when going outside the happy flow. It might be better to stop processing and throw an exception (e.g. IllegalArgumentException). This way we’re sure we end up in our safety net and will be able to repair it later. Instead of ending up with corrupt data in some system downstream.

In contrast if we choose to go with default values or fallbacks we may end up with corrupt data that we won’t be alerted about until months after the fact. Our systems have looked stable all that time.

Endpoints

Endpoints, rest or otherwise, require a bit more thought. If you can make them idempotent it will mean your consumers can simply repeat the call without issues.

A lot of endpoints however will have side effects or your consumers won’t be able to resend their requests. Think of the poor end user! So we’ll have to engineer our way out of it.

Transactions

If possible the simplest way to be fault tolerant is to use a transaction and just rollback everything once we hit some exception. Like with queues we then have to make sure that we also throw our own and stop processing instead of making assumptions in our code.

Do note that if there is a lot of processing it can prolong the transaction and cause issues for performance once you try to scale up.

Internal queue

If transactions and letting your consumer deal with it is not an option I try to see if I can offload the request to a queue. If we can make the processing asynchronous it gives us an option to deal with any issues later and leverage the benefits we looked at earlier.

Event sourcing

A variant of the internal queue would be to use event sourcing. Storing (raw) states separately from the views on that data means we can recalculate those views later if we see that we made mistakes with our assumptions.

Both require that we can live with eventual consistency in our data. In my opinion though getting to that state has many upsides ranging from easier management of downtime to operational excelence to scalability. This makes it worth it to me to have the extra engineering complexity.

Wrapping it up

Now all we need to do is be careful about assumptions in processing and instead of defaulting to simply throw errors. Then we log and alert on those to make sure we’re alerted as soon as possible. Any if or switch statement is a point where things can go wrong. Not to mention data issues from your dependencies.

In a sufficiently complex system it gets hard to impossible to predict all possibilities coming into your service. At that point you want to be able to rely on a safety net that you put in place. With the right alerts and or dashboards in place you’ll spot non-happy flow issues as soon as they happen and can start with a forward fix soon after.
Compared to first having to do damage control after an issue has gone unnoticed as the failures were either masked by defaults or simply dropped causing issues in your dependent services.

Enjoyed this? Read more in the 10x Engineer series.

Share on:

Support: