filesystem_eks_ebs

Posted on 28 December 2020, updated on 21 December 2023.

What happens when you have a shared file system, but only at your disposal is a simple disk-mounted disk on a VM? It was my issue with an on-premise Sentry on AWS EKS and only EBS disks in the cluster. Maybe Kubernetes orchestration offers a very fine management of Pod Scheduling with Node and Pod affinity!

The issue: the need to share a filesystem

A few weeks ago, developers came to me with a problem: a feature of Sentry, source maps, was not functioning correctly. Sentry is a great tool to monitor end users' errors in your frontend and report them to a unified platform. You have a great platform to assess and resolve bugs in your application, with user/project management, alerting, and more!

Sentry is available as a SAAS, but it actively supports OpenSource and you can easily deploy the solution on your own machines. Here at Padok, we are fully engaged in the Kubernetes revolution, and therefore naturally deployed it quite a time ago with Helm on our own AWS EKS Cluster.

For this specific issue, we first thought about the version of the sentry, which in our case is quite lagging behind (we are still using the now deprecated old helm chart). However, after a few minutes of research, we found the following issue on GitHub, which was exactly our problem!

commentary

Sentry has quite a special software architecture, using the main container and a few workers, with Celery as a job queue in between. But it also needs to share data through the filesystem, at the path /var/lib/sentry/files. I find this is quite an anti-pattern in the Pet/Stateless/Disposable world of Kubernetes Applications, but since this is a requirement, I had to make it work!

So my main mission is clear: share a filesystem between pods on Kubernetes, with ReadWriteMany access, which is read and write for multiple applications on the same volume.

About filesystem on Kubernetes

In this blog, we have already covered a bit about how Kubernetes storage works, for example with the setup of an NFS server and volume provisioner for >ReadWriteMany volumes.

volume

This schema is a good summary of the elements required to have a persistent (i.e. the storage is still accessible after container shutdown) volume on Kubernetes.

  1. First, install a Provisioner, which handles the abstraction between the host node or the Cloud Provider services and your Kubernetes cluster. A provisioner must follow the Container Storage Interface (CSI).
  2. A pod that needs a volume to ask for it using a PersistentVolumeClaim (PVC), in which you specify:
    - a name to retrieve or share it
    - an access mode: ReadWriteOnce, ReadOnlyMany or ReadWriteMany
    - a storage request: 10go for example a StorageClass name, which represents the type of volume you need. A Provisioner is responsible for one or more StorageClasses.
  3. The Provisioner reads the PVC and creates for the pod the corresponding PersistentVolume, handling all the complex logic of asking the specific cloud provider for a disk and providing it to the node.
  4. This PrivateVolume is then mounted at a specific place on the filesystem of the pod, which can now use to persist (or share) its data.

If you at least specify PVC to support ReadWriteMany, you should be good to go, no? Well in my specific case, I was using the AWS EBS CSI which has specific constraints.

AWS volumes for EKS : EBS vs EFS

In another article, we've covered the setup of the EFS CSI driver on AWS EKS, but what is the difference between these two?

Amazon EBS is the old reliable way of provisioning a disk for an EC2 virtual machine on AWS. The volume is linked to an Availability Zone (AZ), and can only be linked to one EC2 machine at a time. Fortunately, it can be detached and mounted on another machine.

However, Amazon EFS offers a Network File System (NFS), which can be shared among several machines across AZ. The performance might be a bit worse than EBS because of the overhead of the NFS server.

The main difference would be the price (in Europe -Paris):

  • For EBS, you pay $0.116 per GB-month of provisioned storage for classic gp2 SSD. With this pricing model, you'll pay more than your usage since you have to reserve a certain capacity. (However, it seems that there are gp3 volumes available, at $0.0928/GB-month, paying what you use).
  • For EFS, it's 3 times more expensive at $0.33 per GB-month, but you only pay for what you use. However, you won't have full SSD performances.

If you need to choose between these two alternatives, here is some advice:

  • If your app is stateful and needs a lot of storage, go for EBS, you'll save a lot of money. However, moving your application around nodes will be difficult.
  • If you have a stateless app with several replicas, which need to share data, or are regularly changing nodes, EFS would save you a lot of trouble.

For my specific case, in my EKS Kubernetes cluster, I didn't have the choice and stayed on EBS. But this gave me a huge constraint: since an EBS volume is mounted on a unique EC2 node, all pods sharing the volume need to run on the same Kubernetes node (ie EC2 Virtual Machine).

Gather pods with PodAffinity

This was a new use case I had never quite encountered. I could create a specific node pool with one node just for Sentry, and use NodeSelector to force all pods to run on this node; however, it might be a waste of computing resources and complexify my infrastructure.

Thankfully, you can go further than NodeSelector with NodeAffinity and PodAffinity. It allows you to specify precise conditions for the scheduler regarding the creation of your pod in a specific node. For example, you could:

  • Run pods on a specific set of nodes, just as with a nodeSelector but with a NodeAffinity.
  • Prevent two pods of the same application to run on the same node, which is not a great practice for High Availability, using podAntiAffinity.
  • With the same rule but by changing the topology key, you can enforce that pods try to avoid being in the same Availability Zone (AZ).
  • On the contrary, you could enforce that two pods should stay close (same node or same AZ, depending on the topology key), using podAffinity. It can be useful.

For my specific use case, I want all 4 pods of the Sentry stack to share the same node, to read the same EBS volume, but I don't care about the node in itself since the volume can move into the zone (if the cluster if multi-AZ, ensure with nodeAffinity that you stay in a specific AZ). Therefore I'll use podAffinity.

My main pod, sentry-web, can be found with its labels app=sentry and role=web. For the two other deployments sharing its filesystem, I’ll force them to stay on the same pod with PodAffinity.

Here is an extract of my values.yaml:

I have redeployed my app with Helm, let’s see if all my pods are on the same node with a one-liner:

$ kubectl get pods -l app=sentry -o 'jsonpath={.items[*].spec.nodeName}' | tr " " "\n" | sort -n | uniq -c

4 ip-***-***-***-***.eu-west-3.compute.internal

All 4 pods are on the same Node, and can therefore use the same EBS disk!

As you can see I used requiredDuringSchedulingIgnoredDuringExecution, so if my main web pod moves for one reason or another without the other pods being shut down, I’m in trouble. However, requiredDuringSchedulingRequiredDuringExecution would solve this issue by rebalancing pods dynamically in case of an event doesn’t exist (yet) on Kubernetes!

Final thoughts: was it a good idea after all?

If you are familiar enough with the concept of scheduling in Kubernetes, you can see that this technique can fail. My main web pod could be scheduled on a busy Node, on which there are not much more resources available. The other pods will stay in Scheduling since they MUST be on the same node, but they also MUST not overload the node (the sum of requests must not exceed the capacity of the Node).

You could get around with Eviction and PodDisruptionBudget but that is another subject, and I won’t go into this rabbit hole!

If I had the opportunity to refactor this part, I would use EFS storage (or even AWS S3 if Sentry supports it), paying the (small) extra price for more stability and less complexity! I would get easy scheduling, real horizontal scalability, and multi-AZ availability.

That’s also the job of a Cloud Engineer: make the right compromise between cost, complexity, and delivery. Here at Padok, we try to find the best solution for our clients!