Every system eventually breaks. Every system needs fixing.
That doesn’t mean it was badly built in the first place. It’s just that it is built from parts that are themselves failable.
So, how do we go from there?
Do we act like everything will run fine and act surprised every time it doesn’t?
Or do we take that fact into account and integrate that uncertainty into our engineering process?
In 2011, Netflix engineering teams opted for the second option by releasing a piece of software name Chaos Monkey.
When run against your infrastructure Chaos Monkey will intentionally disable some of its elements to test the reliability of the system and how it responds to the outage.
Netflix notoriously runs Chaos Monkey on its production environment.
Kubernetes pods were designed to be disposable. Replication controllers, liveness probes, update strategies, etc. are all mechanisms that take advantage of that fact.
That pod doesn’t respond to the HTTP health check request?
Just kill it.
That pod isn’t deployed with the latest app version?
Just kill it.
But things can run smoothly for a long period of time without the need to kill and restart new pods.
Then the question arises: what would happen if that pod were to be killed?
It’s been there for so long that you’re not sure about the answer and since there is no problem with it, better not take any chance and just leave it alone.
There comes the Kubernetes tool named Kube-monkey. With it, it is now possible to implement chaos engineering on a Kubernetes cluster.
Just like Chaos Monkey does with servers, Kube-monkey randomly kills pods from your cluster to test its resilience.
The idea behind Kube-monkey is to never be in the aforementioned situation. Pods are randomly killed daily and your system MUST be able to deal with their sudden disappearance to be able to run long term.
I chose to install Kube-monkey via its Helm release but you can find instructions to do it with Kubernetes manifests on the project Github page.
values.yaml contains all the parameters you can tune for your deployment. Be sure to give it a glance before you run the installation.
You can always tune some values after installation with the helm upgrade command.
Once the Kube-monkey runs on your cluster, you have to annotate your deployments for them to be (willing) victims of the monkey.
Here is the meaning of some of these annotations:
kube-monkey/mtbf (mean time between failure): Specifies the mean number of days between the termination of two pods in that deployment. In that example, a pod will be killed every two days.
NB: The termination of pods is only scheduled on weekdays. This is a common behaviour in chaos engineering. The idea behind it is to have people on deck to fix the potential issues created by the termination of elements of infrastructure.
kube-monkey/kill-mode + kube-monkey/kill-values: Specifies how many pods should be killed in the deployment. Here, only one pod will be killed. It is also possible to set a percentage of pods or even to kill them all.
Kube-monkey in action
Every weekday during scheduling, Kube-monkey will:
- List the deployments that can be victims of the monkey
- Flip a biased coin to determine whether or not a pod should be killed today. The bias values come from the
- Calculate a random time for the termination to happen
While it is not the ultimate tool to test the resilience of your cluster, Kube-monkey helps in fully embracing the Kubernetes way of life: pods should be disposable and should be disposed of.
One limitation is that it only tests the consequences of the termination of pods when problems could come from a wider range of causes: configuration errors, network outages, etc.
Kube-monkey is still super easy to set up and I would definitely recommend giving it a try on your cluster.
Check out this article if you’re interested in other tools that can help you build a more resilient and production-ready Kubernetes cluster.