Cloud Outsourcing: Our Monitoring Approach

Posted on 15 February 2024.

Outsourcing at Padok involves taking charge of managing and optimizing the IT infrastructures of our clients. Padok offers flexible solutions tailored to various needs, ranging from daily maintenance to incident response and strategic evolution.

Why outsource this management? Four main reasons encourage companies to seek this expertise:

Focusing technical teams on core business activities.
Uninterrupted stability of the infrastructure.
Avoiding on-call responsibilities for employees.
Access to high-level external expertise.

At Padok, we handle a wide range of clients and strive to simplify this process. For example, by avoiding the implementation of a specific monitoring system for each client. Historically, the industry has tended to focus on machine health. At Padok, we concentrate on user experience health.

In this article, we'll begin by introducing two existing monitoring types and the approach we've chosen at Padok. We'll then present the metrics we monitor to oversee our clients' infrastructures and how this monitoring is implemented in tools such as Datadog. Finally, we'll discuss key indicators of infrastructure health: indicators serving our clients and their end-users.

Black-box monitoring vs white-box

Monitoring involves collecting, processing, aggregating, and displaying real-time quantitative data about a system, such as the number and types of requests or errors, processing times, and server lifetimes.

The book "Site Reliability Engineering: How Google Runs Production Systems" distinguishes two types of monitoring: white-box monitoring and black-box monitoring.

White-box monitoring relies on the ability to inspect the system's internals, like logs or HTTP endpoints. It allows the detection of imminent issues, failures masked by retry attempts, and so on. An example of white-box monitoring would be monitoring the performance of a server's CPU or RAM.

Black-box monitoring, on the other hand, focuses on user journeys (web or mobile applications, for example) rather than server components. It concentrates on symptoms and represents active, unforeseen issues. An example is monitoring the number of HTTP requests sent to the web server by a client and returning without a response or with abnormal latency.

Historically, the industry has concentrated on white-box monitoring, favoring a deep internal understanding of systems. This choice was mainly due to the need to anticipate resource consumption on the hardware side to avoid impacting application performance.

Nowadays, hardware is no longer a problem with the emergence of virtualization and subsequently the cloud. We can now focus on issues directly affecting users through black-box monitoring. It provides results-based visibility, enabling quick responsiveness and better adaptation to complex and heterogeneous systems. The emphasis is now on user experience and overall system performance.

At Padok, our focus on metrics directly impacting end-users guided our choice to implement black-box alerts. Alerts based on measurable indicators directly impacting our clients' business.

To learn more about black-box and white-box monitoring, you can check out this article on our blog (in French). black-box-white-box

The Four Golden Signals

Padok's outsourcing team monitors four key indicators to assess the performance and reliability of its clients' infrastructures:

Latency: the time taken by the system to respond to requests.
Error rate: the percentage of failed requests.
Traffic rate: the amount of traffic passing through the system.
Saturation: the degree of system resource utilization (CPU, memory, disk).

These indicators are inspired by the Four Golden Signals from the reference book "Site Reliability Engineering: How Google Runs Production Systems."

Google developed these principles historically to address its rapid growth and the increasing complexity of its systems. These signals arose from direct experience with performance challenges, aligned with user experience, and anchored in a culture of innovation and operational excellence. They offer a simple yet powerful framework to maintain service quality.

At Padok, we implement black-box probes, conducting synthetic tests on Datadog to monitor latency and errors on public endpoints directly. This provides a precise view of the user experience.

Three types of probes are used: Uptime check, Certificate check, and Browser & API tests. They monitor the availability of a system or critical user paths by regularly checking if they are online and accessible.

We chose Datadog as our observability platform due to its robust features, especially for implementing comprehensive black-box tests. Blackbox Exporter and Pingdom, to name a few, do not allow for multiple checks. Datadog provides essential flexibility to implement complete black-box tests.

Saturation and traffic can be measured directly on machines in white-box mode using managed services or the Prometheus Stack. However, saturation issues in the cloud are often managed by autoscaling approaches, making alerting on them less popular. However, one cannot scale infinitely while aiming for a cost-optimization culture.

By measuring these four indicators and alerting a human when a signal poses a problem, our clients' services are properly monitored. Additionally, they have access to the availability of their platform transparently through the SLO resulting from black-box monitoring.

Service Level Objectives (SLO)

SLOs are Service-Level Objectives defined by operational teams in consultation with product stakeholders. They represent promises made to customers regarding service availability and quality. In the case of outsourcing at Padok, SLOs are measured in percentage of availability (uptime) and are used to assess service performance.

For example, a 99% uptime SLO means the service must be available 99% of the time. SLOs serve as a basis for measuring actual performance against expectations and for taking corrective measures if objectives are not met.

To ensure total transparency, we set up dashboards in Datadog that our clients can access. This allows them to track the availability of their platform.

We also integrate Datadog with Slack to report alerts in dedicated channels. These alerts reference runbooks, which are collections of documented procedures explaining how to handle a particular process. They facilitate communication and incident resolution.

In conclusion, our user-centered monitoring approach, coupled with the implementation of black-box probes and SLOs, accelerates the recovery of our outsourcing clients' infrastructures. This contributes to maintaining a quality and stable infrastructure.

In case of an incident, we act promptly through a well-defined process of alerts, reactions, and post-mortem analysis, limiting the recurrence of similar incidents.

Furthermore, our approach allows us to identify infrastructure weaknesses, investigate issues, and propose continuous improvements. For our clients, this results in optimal responsiveness and an enhanced user experience, reinforcing trust in our outsourcing services.