Prometheus monitoring stack set up with Ansible

Posted on 9 February 2023.

Do you have a Virtual Machine running Linux that you want to monitor? Do you have several Docker containers that need monitoring but you don’t know where to start? With Ansible, we will see how to set up a full monitoring stack using Prometheus and Grafana to monitor VM and/or Docker containers.

Our objectives

Let’s say that you have a machine running Virtual Machines (VMs) that themselves are running Docker. You can also have another real machine (or more!) that you want to monitor. This stack is really flexible, and that is the whole point. What is important is that you have something to monitor, whether it is VMs, real machines or both does not matter.

For today, we will use 3 VMs on a unique machine and we will monitor only the VMs.

WhatWeHave

Ok, so now that we all agree on what we have, let’s talk about want we want. One simple word: monitoring. For once in this blog, we will not use Kubernetes. Why? Because we want a simple way to start, and Kubernetes is not that simple for beginners.

So what do we want? We want all our VMs and real machines monitored, with a simple dashboard to visualize the data.

We also want it to answer these criteria:

I can easily add a new machine or VM to monitor
I can store my configuration as code using GitHub
I can safely store my secrets on Github without revealing them to everyone that has access to the repo

I have the perfect solution to achieve these goals: Ansible.

How to achieve our objectives?

First of all, I will not cover here what is Ansible nor how to install it. Others in the community have done it many many times.

However, I will talk about the pattern that we will use to monitor our infrastructure: the observer and the targets.

It is quite simple actually: we have a machine that monitors all the others. On this observer, we will install Prometheus and Grafana. On the targets, we will install the agents that will report the data of our VMs and their Docker containers: node-exporter and cAdvisor.

NotFixed

node-exporter will report metrics on the hardware and the OS (things like CPU consumption for instance), whereas cAdvisor will focus solely on Docker-related metrics. One can work without the other, like on VM 2 on our schema. It is useless to install cAdvisor on VMs that do not execute Docker.

Now that you have the theory, let’s dive deeper into our architecture. Our folders will be structured like that:

├── inventories            -> hosts inventory files
│  └── hosts.yml           -> describes the different hosts
├── playbooks              -> ansible playbooks
├── roles                  -> ansible roles

And in our hosts.yml, you will find… the hosts! Whether it is an observer or a target, it will be referenced here. The whole monitoring stack is referenced here.

all:
  children:
    observer:
      hosts:
        padok-observer:
          ansible_host: 192.168.0.1
    target:
      hosts:
        padok-observer:
          ansible_host: 192.168.0.1
        padok-target-1:
          ansible_host: 192.168.0.10
        padok-target-2:
          ansible_host: 192.168.0.11

You might find something weird in this file: padok-observer is mentioned twice. This is because I lied to you earlier! The schema I gave you is missing something: self-monitoring.

Fixed

This one is way better. Now we also have the first VM’s metrics. In fact, the observer is also a target. That’s why it is mentioned twice: once in the observer list, and once in the target one.

Now that everything is crystal-clear, we can start working on the observer.

Let’s set up the observer

For the observer, we will use three open-source software:

Prometheus is a free software application used for event monitoring. It records real-time metrics that are collected through network calls in a time series database.
Grafana is an interactive visualization web application. It provides charts and graphs when connected to supported data sources such as the Prometheus server.
Prometheus Alertmanager is a web application that handles alerts sent by client applications such as the Prometheus server. It allows us to forward these alerts to a big range of software such as Slack, OpsGenie, etc.

These guys will interact together to create a full monitoring stack.

The configuration of Prometheus, Grafana, and Alertmanager is not the main topic of this tutorial. But you will find the entire codebase in our GitHub.

Moving back to Ansible. In order to make everything work correctly, in our Ansible roles we need to :

Create the configuration folders
Create the configuration files
Create the application containers

This gives us this roles/observer/tasks/main.yml :

- name: Create Folder /srv/prometheus if not exist
  file:
    path: /srv/prometheus
    mode: 0755
    state: directory

- name: Create Folder /srv/grafana if not exist
  file:
    path: /srv/grafana
    mode: 0755
    state: directory

- name: Create Folder /srv/alertmanager if not exist
  file:
    path: /srv/alertmanager
    mode: 0755
    state: directory

- name: Create prometheus configuration file
  copy:
    dest: /srv/prometheus/prometheus.yml
    src: prometheus_main.yml
    mode: 0644

- name: Create prometheus alert configuration file
  copy:
    dest: /srv/prometheus/prometheus_alerts_rules.yml
    src: prometheus_alerts_rules.yml
    mode: 0644

- name: Create grafana configuration files
  copy:
    dest: /srv/
    src: grafana
    mode: 0644

- name: Create alertmanager configuration file
  template:
    dest: /srv/alertmanager/alertmanager.yml
    src: alertmanager/alertmanager.j2
    mode: 0644

- name: Create Prometheus container
  docker_container:
    name: prometheus
    restart_policy: always
    image: prom/prometheus:
    volumes:
      - /srv/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - /srv/prometheus/prometheus_alerts_rules.yml:/etc/prometheus/prometheus_alerts_rules.yml
      - prometheus_main_data:/prometheus
    command: >
      --config.file=/etc/prometheus/prometheus.yml
      --storage.tsdb.path=/prometheus
      --web.console.libraries=/etc/prometheus/console_libraries
      --web.console.templates=/etc/prometheus/consoles
      --web.enable-lifecycle
    published_ports: "9090:9090"

- name: Create Grafana container
  docker_container:
    name: grafana
    restart_policy: always
    image: grafana/grafana:
    volumes:
      - grafana-data:/var/lib/grafana
      - /srv/grafana/provisioning:/etc/grafana/provisioning
      - /srv/grafana/dashboards:/var/lib/grafana/dashboards
    env:
      GF_AUTH_ANONYMOUS_ENABLED: "true"
      GF_AUTH_ANONYMOUS_ORG_ROLE: "Admin"
    published_ports: "3000:3000"

- name: Create Alertmanager container
  docker_container:
    name: alertmanager
    restart_policy: always
    image: prom/alertmanager:
    volumes:
      - alertmanager-data:/data
      - /srv/alertmanager:/config
    command: >
      --config.file=/config/alertmanager.yml 
      --log.level=debug
    published_ports: "9093:9093"

You might have some questions about some interesting particularities of the setup, for instance, how to handle a secret.

Using Ansible Vault and Jinja2 to handle a secret

In our monitoring configuration, we often have passwords or tokens. We call them secrets because, well, that’s what they are: secrets. Remember our objectives: we want to be able to store our code as well as our secrets safely on GitHub (or any other repo). But we don’t want our secrets to be readable by anyone! Well, worry no more: let me introduce to you Ansible Vault.

With a simple command, you can encrypt your secret. It will ask for an encryption password, do not forget it, you will need it at every deployment :

ansible-vault encrypt_string "password" --ask-vault-pass

It will give you something like:

!vault |
          $ANSIBLE_VAULT;1.1;AES256
          64306663363562356132323065396635636630373031303739323666373262663961393132316333
          6135653763363566303331313639633030623530646239310a353236343035643132646230333466
          36336439376131333630346563323833313164353265313264643232373465633561663331396133
          3163303166373166390a396131303239356139653063616437363933333130393563646338663933
          3966

And voila! Your secret is now encrypted. You can use it with Jinja2 with the template command (as we saw before) and your roles/observer/defaults/main.yml:

---
prometheus_version: v2.40.1
grafana_version: "9.2.5"
alertmanager_version: v0.24.0
alertmanager_smtp_password: !vault |
          $ANSIBLE_VAULT;1.1;AES256
          64306663363562356132323065396635636630373031303739323666373262663961393132316333
          6135653763363566303331313639633030623530646239310a353236343035643132646230333466
          36336439376131333630346563323833313164353265313264643232373465633561663331396133
          3163303166373166390a396131303239356139653063616437363933333130393563646338663933
          3966

In the roles/observer/templates/alertmanager.j2, you can call the alertmanager_smtp_password which will be decrypted when applied:

route:
  receiver: "mail"
  repeat_interval: 4h
  group_by: [ alertname ]

receivers:
  - name: "mail"
    email_configs:
      - smarthost: "outlook.office365.com:587"
        auth_username: "test@padok.fr"
        auth_password: ""
        from: "test@padok.fr"
        to: "test@padok.fr"

I want to deploy!

Enough talking, let’s deploy our observer! Thanks to the tags in our playbooks, we have the ability to deploy only the observer.

This is our playbook in playbooks/monitoring.yml:

- name: Install Observability stack (targets)
  hosts: target
  tags:
    - monitoring
    - target
  roles:
    - ../roles/target

- name: Install Observability stack (observer)
  hosts: observer
  tags:
    - monitoring
    - observer
  roles:
    - ../roles/observer

In our Ansible Playbook command, we will specify that we want to execute only the roles with the observer tag.

ansible-playbook -i ansible/inventories/hosts.yml -u TheUserToExecuteWith ansible/playbooks/monitoring.yml -t observer --ask-vault-pass

That’s it! You now have a fully functional observer. You can access Prometheus on port number 9090, and Grafana on the 3000. The only thing missing now are the targets and you will have installed a full monitoring stack.

Let’s set up the targeted ones

For the targeted ones, the setup will depend on what is on your machine. We will use two main agents: node-exporter and cAdvisor. As I’ve said before, node-exporter will focus on hardware and OS metrics, whereas cAdvisor will report Docker-related metrics.

Here, we will install both. This gives us this Ansible role: roles/target/tasks/main.yml :

- name: Create NodeExporter
  docker_container:
    name: node-exporter
    restart_policy: always
    image: prom/node-exporter:
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command: >
      --path.procfs=/host/proc
      --path.rootfs=/rootfs
      --path.sysfs=/host/sys
      --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)
    published_ports: "9100:9100"

- name: Create cAdvisor
  docker_container:
    name: cadvisor
    restart_policy: always
    image: gcr.io/cadvisor/cadvisor:
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    published_ports: "9101:8080"

These two will expose their metrics on a specific port (9100 for node-exporter and 8080 for cAdvisor) and create an endpoint called /metrics for Prometheus to scrape.

There is only one thing missing for our monitoring stack to work: Prometheus needs to be aware of the targeted ones. For that, we will add them to the scrape_configs of the Prometheus config file (/roles/observer/files/prometheus_main.yml) :

scrape_configs:
  - job_name: prometheus
    scrape_interval: 30s
    static_configs:
    - targets: ["localhost:9090"]

  - job_name: node-exporter
    scrape_interval: 30s
    static_configs:
    - targets: ["192.168.0.1:9100", "192.168.0.10:9100", "192.168.0.11:9100"]

  - job_name: cadvisor
    scrape_interval: 30s
    static_configs:
    - targets: ["192.168.0.1:9101", "192.168.0.11:9101"]

You can now deploy the monitored ones. To do that, same as before, use the tags:

ansible-playbook -i ansible/inventories/hosts.yml -u TheUserToExecuteWith ansible/playbooks/monitoring.yml -t target --ask-vault-pass

Be careful, if you have modified the Prometheus config file, you will also need to redeploy the observer and restart Prometheus to apply the configuration.

Add a targeted machine

To add a targeted machine or VM, the steps are quite easy:

Make sure that you can connect through SSH
Add the IP of the targeted machine and its hostname under target in the inventory (hosts.yml)
Add the IP of the targeted machine in the targets of the node-exporter job (and cadvisor if you also want to monitor the containers) of the observer machine (ansible/roles/observer/files/prometheus_main.yml).
Run the ansible-playbook command for both the observer and the targets
That’s it!

We have now covered everything. All the codebase is available publicly.

Here is what it looks like (this is the node-exporter dashboard):

node_exporter_dashboard

The end?

In fact, it might not be the end.

The use of templating with Jinja2 may allow you to automatically update the Prometheus config file in accordance with the hosts.yml file.

Also, remember when I told you that we will not use Kubernetes because it wasn’t simple enough? Well, I might have been wrong. Why? Because with Kube, the Prometheus operator does everything I’ve shown you automatically. The initial setup is longer, but it may be a better solution from case to case. It is up to you!