Posted on 13 January 2021, updated on 6 August 2021.
Error management and program resilience is a common concept in software engineering. When a function returns an error, you have to properly handle it to prevent your app from crashing and your service from being disrupted.
However, errors do not only happen inside of a program and their causes can be very diverse, going from network issues to hardware failure. At Padok, we manage highly available Kubernetes cluster for our clients, so handling these issues is part of our daily hassle. This article will show you a way to avoid such problems.
Errors in an IT system happen all the time, to everyone. Yes, everyone, even tech giants such as Google, you can see there were plenty incidents on the status summary page. There are countless different root causes to these problems, and countless ways to fix them or to prevent them from happening.
This article will not tell you how to build an IT that magically never crashes. It will merely explain a specific concept, that is Leader Election, which is used as a preventive measure to allow systems to heal from failures. It is an introduction to building highly available resilient systems.
Leader Election is a technique that is used to allow distributed systems to come to a global agreement, and for each component of the system to work towards achieving the same goal.
To illustrate this, let's say you have a database running on a dedicated server. If a client makes a query to this database, it replies with the value it has in store. There is a single source of truth to guarantee the value returned to the client is correct, and that is the single database. However, in this scenario, the system does not fault-tolerant at all. If for example a hard drive breaks down, or a network cable gets ripped out (seriously, it can happen quite easily by accident), the client will completely lose access to the database until the problem is fixed.
In a professional environment, downtime has a direct impact on business revenue as it deteriorates the quality of service you provide your customers. To prevent that from happening, you can spin up 2 other servers on which your database also runs, that way if one is lost, the two others are still available. Of course, there is still a risk that all the servers crash simultaneously, but the risk is much lower than with a single server.
Now imagine a client make a request to your database. Which of the three instances should answer? Compared to the single-server setup, you don't have a single source of truth anymore. That is the problem of "Distributed Consensus".
That is where Leader Election comes in. The idea is that one of our nodes will be elected as Leader - or primary node - of our system, and the others are Followers - or secondary nodes.
The Leader will act as the single source of truth in our distributed system. The Followers will be on standby, ready to take over as Leaders themselves in case the current Leader is lost. They perform periodic health checks over the network to check if the other nodes are still up, and if the followers cannot connect to the master anymore they communicate with each other to decide on a new leader.
What if half the nodes in a system lose connection to the other half? In this case, if both halves trigger a leader election, we would end up with two leaders, right? This is called a "split brain" situation. If such a situation were to occur, our system would not have a single source of truth anymore, which could lead to unexpected behavior and even corrupted data. To prevent this from happening, elections require more than half of the cluster to participate. This means that in a situation where the nodes are separated groups of exactly half the cluster, the current leader is kept until the network partition is healed. No election is triggered, as it would lead to a split-brain.
Great theory, how about practice?
In practice, Leader Election is used in a number of IT systems. At Padok, we use Kubernetes a lot in our projects. A good example can be found in a highly available Kubernetes cluster, where the components of the Kubernetes Control Plane (
etcd...) are replicated and use leader election to ensure they function consistently.
Setting up a resilient system is a great engineering challenge, but testing it is a whole different one. If you're interested in what can be done, we have a couple of articles about Chaos Engineering and its implementation in Kubernetes, check them out! Also, if you want to find out about the inner workings of a popular Leader Election algorithm, which is used in highly available etcd clusters, for example, check out The Secret Lives Of Data, a visual explanation of the algorithm.