How to define a chaos engineering strategy?
Specifically, chaos engineering is choosing as target one or several parts of the infrastructure to be tested and making them unavailable:
- Shutdown/Restart a VM
- Dropping TCP frames
- Delete Kubernetes pods
The infrastructure must react to this loss and quickly return to its stable state. This requires the automation of resource re-creation, as Kubernetes does natively. It is this ability to regenerate itself, called resilience, that is tested in chaos engineering. As explained in the principles of chaos engineering, the approach is based on 4 principles
Define a stable state
The stable state is the state in which the infrastructure delivers the expected service in the expected time. These expectation levels are defined by the SRE in the form of SLOs, for Service Level Objectives, and above all, they are measured by SLIs, for Service Level Indicators.
Monitoring is the means of defining the stable state of the infrastructure, achieving and maintaining it. Several open-source and high-performance monitoring tools are available. Below you can find an example of a Grafana dashboard representing the activity of an application.
This step is crucial in order to be able to get down to the chaos engineering and fortunately, this work is already the adage of Site Reliability Engineers who know how to identify the key indicators of an application's lifecycle.
Issue one or more hypotheses
Trust does not avoid control.
It is the action of defining hypotheses that marks the entry into chaos engineering thinking. For example: "My infrastructure keeps its stable state after the loss of one Kubernetes node, of 2 or after a disk failure".
In the form of a statement, the hypothesis establishes the desired level of confidence. Like the risk assessment in cybersecurity, the chaos engineer places his hypotheses according to the severity and probability of the malfunction occurring.
Engineers have identified quality levels to be achieved to obtain a level of confidence in their products. The goal of the maneuver is to build a platform that can withstand failures conditions, that has a higher probability of occurring than in production.
Define experimental conditions
Once the hypotheses have been specified, experimental conditions must be defined in accordance with the infrastructure bricks to be tested. The loss of a Kubernetes node results in the planned extinction of the VM in the VMWare hypervisor for example, and for the disk failure, it can be an IO overload or its pure and simple disconnection.
To apply these conditions, one must then turn to the appropriate chaos engineering tools. Any deployment can be considered as fallible: from hardware failure to network saturation to the sending of malformed HTTP responses. Even AWS lambdas can be targeted, (see the open-source chaos-lambda tool).
Studying the deviation
What is to be observed in the deviation is the restoration of the stable state and the time between the appearance of the dysfunction and the return to normal.
On this chaos-mesh demo (chaos engineering tool that I detail below), we can see very clearly on the graphs of Grafana, two quick self-healing events but also a very slow return to a stable state. It is thanks to the good supervision of the infrastructure induced by the principle of defining the steady-state that the indicators allow us to identify relevant deviations.
The next step is to categorize these gaps, keeping in mind the following question: "How much do 1 / 2 / 5 minutes of denial of service cost me ?" The objective is to establish which ones are to be treated as areas for improvement and which ones can be accepted.
"Okay, but how do I implement this?"
The range of chaos engineering tools has expanded in recent years. Most of them are created to be deployed on a Kubernetes cluster (the 'kubes' we talked about in the intro 😉 ). They implement the experimental conditions that chaos engineers have conceptualized.
All the tools I present below are displayed on the CNCF website.
There are those that attack Kubernetes
chaoskube which doesn't pretend to be multi-purpose oriented but focuses on the random deletion of Kubernetes pods and does it perfectly.
chaos-mesh whose scenarios are very simple but effective :
- pod/container kill or failure
- CPU/memory burn
- Kernel/Time chaos
- IO/Network chaos
There are the Swiss knives
Litmus which proposes through a catalog of scenarios called ChaosHub, some essential ones but offering the possibility to the community to contribute by publishing their scenarios and thus to quickly enlarge the catalog.
chaosblade that capitalizes on the practices of the Chinese cloud provider Alibaba. It allows to attack basic resources such as CPU, memory, disks, but also Java or C++ applications, and also Docker containers or Kubernetes objects.
There are guided tools
PowerfulSeal which tackles Kubernetes, OpenStack, AWS, Azure, GCP, which connects easily to Prometheus and Datadog to analyze its activity, but above all can be launched in standalone mode! This allows, by using pre-defined experimental conditions, to focus on the analysis of the results and on the search for fallible parts.
ChaosToolkit, an open-source API that is intended to be simple enough to use to present itself as a facilitator of the adoption of chaos engineering with experimental conditions written in JSON format.
Pros and Cons
The advantages of chaos engineering are numerous and all revolve around the gain of confidence that this practice brings. Indeed, it allows establishing a measured and anticipated margin of failure in which the infrastructure remains stable and continues to deliver the service as expected.
Once you have taken your first steps in this model, if you feel you have gained maturity, it becomes possible to apply the experimental conditions of chaos engineering to your production environment. In agreement with your SRE, and depending on the "error budget", denial of service is accepted if it is controlled, monitored, and contained and can identify new areas for improvement.
What is not necessarily a disadvantage but which can slow down the implementation of chaos engineering on its infrastructure is the questioning that this implies. It is necessary to show a virtuous humility that will lead to more resilience, it is a necessary step.
Then it is the highlighting of malfunctions that can be so diverse that it is easily possible to forget some of them, and these can prove to be relatively serious and probable if not categorized rigorously.
Chaos Engineering is for any IT production team concerned about the quality of its infrastructure delivery and its resilience: its ability to withstand shocks and to regenerate itself.
Technical barriers to entry are relatively low, however, it takes some design thinking to implement these practices on existing infrastructure or even to integrate it into the deployment of a future platform.
We can take as an example the GAFAM which implemented chaos engineering and other Netflix, Dailymotion, LinkedIn, UnderArmour, Expedia, Target / Wallmart, and which are among the most robust platforms today.
Will you give it a try?
"But what about the monkeys in this story?" Well, I'll let you read this article about Kube-monkey.