Tech Radar Cloud - Resilient

Adopt

Caas

"Container as a Service" services enable you to get closer to "Serverless" without fundamentally impacting the way you organize or develop your applications.

Today, most Cloud Providers offer CaaS (e.g., CloudRun on GCP, AWS Lambda/ECS on AWS, ACS on Azure). These are ideal for teams that do not require extensive customization. Some of their features (e.g., scale to 0) enable significant maintenance and cost savings.

Using a CaaS service today can be a great first step in modernizing your applications or creating a new one.

CaaS is a real alternative to :

Kubernetes is complex to set up and maintain.
Serverless, because of the architectural transformation it implies

Another advantage of CaaS is that it is, by definition, less vendor-locked than other hosting services. Your application is packaged via a market standard (an OCI image) and can be deployed on any service supporting these OCI images, such as a future Kubernetes cluster.

We've observed several different behaviors when using these services: either they're entirely adopted and perfectly suited to our customers' needs, or they lack customization, and our customers are naturally pushed towards using Kubernetes. This is no longer an issue today, but 1 year ago, Cloud Run didn't support the CPU being "always allocated," and we couldn't use it for applications with background processes.

These services have become indispensable, and we recommend them. If using a Cloud Provider isn't an option, building your own CaaS using open-source technologies like Knative is always possible.

Kubernetes

Kubernetes is the current community standard for scalable containerized application deployments.

Deploying an application in production involves solving a number of technological challenges, such as:

Ensuring the reproducibility of deployments
Address multiple replicas of your application to ensure resilience
To be able to adapt treatment capacity to the load
Incorporate external components into your deployment at a lower cost (i.e., without maintaining a whole series of additional deployment scripts).

Before Kubernetes and the era of containers, we would have used an army of bash or python scripts, Ansible playbooks, or even a setup to ensure application deployment. Today, all this is elegantly replaced by a CI pipeline to produce container images and Helm Charts for deployment.

We would also have used a separate VIP system for load balancing. This is now supported:

Natively by Kubernetes for cluster-internal load balancing
Via integrations with Cloud Providers for external load balancing
Via extensions such as MetalLB on an on-premise infrastructure

The management of processing capacity and the installation of external tools is also facilitated by the use of its YAML interface and the various tools that have emerged from it, such as Kustomize and Helm.

Even so, Kubernetes maintenance does entail a certain burden, even when using a Cloud Provider's managed service. A CaaS will be lighter to maintain but less scalable. Indeed, it is often not possible to install additional controllers on CaaS.

We, therefore, recommend Kubernetes to all our customers who can maintain it and are looking to set up a scalable platform to easily deploy large-scale applications.

PCA

A business continuity plan, or BCP, is the most pragmatic cloud architecture pattern for ensuring high resilience, exploiting the strengths of cloud providers.

A business continuity plan (BCP) guarantees infrastructure availability in the event of a disaster. It is essential for any infrastructure wishing to maintain a high level of availability and deliver an uninterrupted user experience. Before designing an architecture and considering implementing a BCP, you need to estimate a Recovery Time Objective (RTO) and a Recovery Point Objective (RPO) for your application.

The Recovery Time Objective (RTO) is the maximum acceptable time between service interruption and service restoration.
The recovery point objective (RPO) is the maximum acceptable time since the last data recovery point.

Public Cloud Providers offer several ways of guaranteeing a BCP. For example:

AWS regions are located in countries. There can be several per country, and they are subdivided into availability zones.
Availability zones group together several data centers. These data centers host the Cloud Provider's resources.

Depending on the objectives, the application will be hosted on :

An availability zone within a region
Several availability zones within a region, which is the most common approach
Multiple availability zones within multiple regions. This guarantees the highest availability but is also the hardest to maintain, as you need to ensure data consistency across several regions while keeping response times low.

At Padok, we recommend the implementation of BCPs rather than DRPs (Disaster Recovery Plans). These are costly to test, and rarely result in activatable levers, and we are convinced that it is vital that they are carried out at the Cloud Provider level.

The more resilient and partition-tolerant an infrastructure becomes, the more complicated data consistency becomes. This is the CAP theorem.

KEDA

Open source, KEDA enables resources deployed in Kubernetes to be scaled based on external events.

One of our main objectives as DevOps is ensuring that infrastructures can handle the load quickly. However, scaling our resources in anticipation is difficult, as we generally rely on CPU and RAM consumption. This is where KEDA (Kubernetes Event-Driven Autoscaling) makes the task much easier.

KEDA is a component that extends Kubernetes' event-based autoscaling capabilities.

It monitors:

queues
data flows
messaging systems

By monitoring these, it is possible to trigger application scaling based on the event load of numerous services such as Kafka, RabbitMQ, Azure Service Bus, AWS SQS, and PubSub. It is, therefore, possible to start scaling a resource when many messages arrive upstream of a stream, for example, or even scaling to 0 when no messages are present.

KEDA is an excellent choice for event-driven autoscaling. There are alternatives, such as proprietary solutions offered by Cloud Providers like AWS Lambda, Azure Functions, or Google Cloud Functions. However, KEDA stands out for its open-source approach and compatibility with Kubernetes.

We, therefore, offer you a powerful and flexible tool for managing event-driven autoscaling in Kubernetes. We recommend using it if you're looking for a solution to manage the event-driven scalability of your containerized applications efficiently.

Synthetic Monitoring

A technique for monitoring your applications that involves simulating a real user with robots to detect malfunctions.

Synthetics Monitoring is a technique for monitoring your applications that involves simulating a real user with robots to detect malfunctions. It contrasts with the classic but now outdated technique of checking infrastructure availability. This method no longer makes sense in highly distributed cloud architectures, where self-healing is present by design.

Generally speaking, priority is given to testing the critical paths of an application. This is the user path that represents the greatest business value. For example, for an online sales site, this would be the purchasing tunnel:

product search
add to basket
payment

This test will ensure that your customers can actually buy on your site. It also validates that your backend services, such as search, session storage, etc., are all operational during the test.

From a simple call to a backend API to a multi-step process, many tools are available to set up Synthetics Monitoring. The choice is vast, from paid tools such as Datadog, NewRelic, and DynaTrace to open-source tools like Blackbox Exporter (Prometheus Stack). Please note, however, that it's best to run these scenarios from outside the infrastructure to position yourself as a "real" client to your application and thus detect network malfunctions (outages, latency).

Beware of external dependencies, however: they can generate alerts you can't act on. Nevertheless, it is important to be aware if one of your suppliers is unavailable. In this case, we recommend setting up specific procedures to deal with such events and implementing a circuit breaker system in your applications. Your application will automatically go into maintenance, and you'll be able to have a specific treatment during your on-call periods. You can imagine being alerted only if the service is inaccessible for more than 1 hour.

Application monitoring should be a standard part of your development cycle. As with security and performance, you need to set up a monitoring phase to ensure your services run smoothly. Don't neglect maintenance and evolution either. You should never go into production without probes to warn you of malfunctions. Nothing is more frustrating than being warned by your customers without realizing it beforehand.

We recommend an "automatic" approach, creating probes for each deployed application. This is what we do on our projects, with the flexibility of the Kubernetes API and tools such as Blackbox exporter or Datadog via Crossplane. As a result, none of our projects goes into production without adequate monitoring to validate that the service is being delivered.

Synthetic Monitoring is essential to ensure that your application is up and running. You shouldn't consider going into production without this kind of monitoring.

Trial

Karpenter

Karpenter is an autoscaling node for Kubernetes. It differentiates itself from its peers by offering to create its autoscaling configuration within the cluster itself.

In the world of node autoscaling for Kubernetes, two significant solutions exist today:

Technology managed by its Cloud Provider, including GKE and its Autopilot mode
Kubernetes Cluster Autoscaler (KCA) is deployed in its own cluster and is still widely used for AWS EKS.

AWS is the last major Cloud Provider not to provide managed autoscaling functionality. Their response to this lack was the development of Karpenter.

Unlike KCA, Karpenter will not rely on the Cloud Provider to create node groups but will manage them using CRDs. Combined with a GitOps tool such as ArgoCD, Karpenter brings a new level of flexibility to node group configuration.

As soon as at least one pod is in the Pending state and the Kubernetes scheduler cannot assign it to a node, Karpenter will provide the right capacity to accommodate this workload. This may involve one or more nodes with sufficient characteristics.

In addition, Karpenter comes with some exciting features:

TTL for nodes: allows you to renew the nodes in your infrastructure, mainly to update them continuously.
consolidation: Karpenter regularly calculates the consequences of deleting a node, which is committed if the cluster size is reduced as a result
drift-detection: dynamic marking of nodes whose configuration is no longer aligned with its associated provisioner

Karpenter, therefore, presents itself today as a very interesting alternative to KCA when using AWS EKS. Karpenter should be compatible with other Cloud Providers in the future. However, the tool's development is very rapid and young on the scale of what already exists. We, therefore, recommend rigorous documentation before implementing it in production.

k6 is a native and extensible kube load testing framework.

k6 is an extensible load-testing framework developed by Grafana Labs. It enables you to test the resilience of your infrastructure to peak loads on critical routes.

k6 uses Javascript as its scripting language. This makes it easy for developers to be owners of the scripts to be run against their applications. k6 itself was developed in Golang, so there's no need to worry about performance.

By default, k6 uses InfluxDB to store metrics, but sending them to Prometheus is possible. This is made possible by xk6, the tool's extension system. This is what won us over at Padok. You can easily add functionalities: MQTT module for IoT, RabbitMQ, event-driven, browser tests, and many more!

k6 is also equipped with a Kubernetes operator, which is still in the experimental phase and does not include a master/worker system by default. Each k6 replica sends raw metrics to your storage system without adding any logic. This caused us problems when sending metrics to Prometheus. And it's up to you to master Grafana and get the data you want out of it. What's more, you'll need to build your runner images incorporating them to use extensions with the operator.

However, running k6 locally or from a VM for simple HTTP tests will be sufficient for sporadic testing. There is also GitHub Actions to integrate directly into your CI pipelines. Using the operator is also possible but will require additional development to integrate its use.

k6 seems to us today to be the most promising load-testing solution for all your cloud and Kubernetes infrastructures. But using it to its full potential requires a bit of work.

Locust

Locust is a tool for measuring the performance of your web application.

Locust is part of a family of tools known as "load testing", enabling you to describe usage scenarios for your web applications and then play them out with many virtual users.

Performing these tests allows you to :

Gain confidence in your application's performance evolution
Check that your current infrastructure is sufficient to handle a heavier load
Identify the various bottlenecks in your application

Locust's strength lies in its simplicity of use and ability to scale, with its central agent and workers model enabling you to reach the performance threshold you want to check almost without limit.

The scenarios are very simple to write, and if you're familiar with Python, you'll have very little trouble writing your first tests.

Locust will also provide a UI featuring real-time performance dashboards and control over progress tests (Stop/Start).

There are many load-testing tools on the market, but Locust is one of the simplest, and we recommend you try it along with k6.

Assess

Nomad

Nomad is a task orchestrator.

The task orchestrator has become one of the pillars of modern infrastructure. With Kubernetes being the best known, we can quickly convince ourselves that it's the only viable alternative and should be adopted by default.

It's a choice that many companies are making, as container management with Kubernetes is very mature. However, the infrastructure of many companies is heterogeneous (containers, VMs, web services...) and would be very costly to containerize. They can therefore turn to Nomad.

Nomad is a general-purpose task orchestrator created by Hashicorp. In addition to managing containers, Nomad features drivers that support tasks in virtual machines, simple scripts, Java applications, and more. Presenting itself as a simple binary, Nomad is lightweight and easy to install.

Backed by Consul (HashiCorp's Service Mesh solution), it is possible to completely federate applications, whatever form they take. This gives Nomad the scalability and ability to adapt to existing systems that Kubernetes lacks. Migration to an orchestrator is therefore much less costly than if you had to containerize all your applications.

Our reservations relate mainly to its lack of integration with existing Cloud Providers, as well as its lower community adoption than Kubernetes. However, we consider it a tool to consider if you have a large on-premise infrastructure or limited resources to allocate to orchestrating your tasks.

Database Operators

Kubernetes operators facilitating database deployment, maintenance and backup.

Deploying a database in Kubernetes using traditional means (generally, a StatefulSet associated with a volume) quickly poses problems that managed services have addressed:

How do you ensure regular data backup?
How do you manage scaling in the event of a load increase?
How can I be sure that updates go smoothly?

Without integrated tooling, all these problems have to be solved by external programs. Maintenance becomes costly and error-prone. When it comes to the impact of an incident on this type of service, the tendency is often to "set it up and forget about it," which also leads to its share of problems.

Kubernetes database operators address these issues. Like all operators, they generally take the form of deployments associated with CRDs. CRDs, in this case, will generally be "Cluster," "Database," "Backup," "User," etc. objects.

They make database configuration declarative, unlike scripts which are generally used to initialize them. They can also automate operations such as updates, scheduling backups, restoring data from backups, connection pooling and monitoring.

At Padok, we've already used the MySQL operator created by Bitpoke to create environments on the fly in Kubernetes easily. The operator's function was to create a temporary database from a backup, which could be used to run integration tests. The performance of this solution is interesting because the DB starts up quickly, unlike a managed service, and is as close as possible to the application using it.

We recommend the use of a BDD operator for this use case. For production use, check the maturity of the operator and make a cost estimate (a managed service can become very expensive compared to a good operator). If you have no particular constraints, we always recommend using a managed service from your Cloud provider.

Hold

Gatling

Gatling is an open source system. It is composed of an open source and enterprise plan.

Gatling offers a hand on experience with extensive customization via load testing scenarios enabled by the scala language (in scala, java, kotlin or no-code language).

One of the advantages of Gatling is its graphical interface. It displays active and ongoing load testing accompanied by key metrics about them. Per example :

RAM and CPU consumption
Server response latency
Percentile calculations

The set-up of Gatling enterprise via the marketplace is straight forward. The open source version is easily installable onto any machine.

Gatling has an extensive open source version. Its limitation is scalability when you want to run distributed load testing.

For these 2 reasons, we recommend today k6 which is a rather less mature product but fully open source and that integrates way better in the technical stack of our clients.