Efficiently schedule pods on your Kubernetes cluster

Posted on 7 December 2023.

Kubernetes offers a range of powerful features that enable to fine-tune the way pods are scheduled on nodes. In this article, I show you different Kubernetes scheduling options that may appear similar but serve distinct purposes, such as PodAffinity/PodAntiAffinity and TopologySpreadConstraints.

What are PodAffinity & PodAntiAffinity?

Pod affinity is a feature that allows you to specify rules for how pods should be scheduled onto topologies (e.g. a node or an availability zone) within a Kubernetes cluster based on the presence or characteristics of other pods. Similarly, the PodAntiAffinity property allows you to schedule pods into topologies based on the absence of other pods.

Common parameters

PodAffinity & PodAntiAffinity require 3 parameters:

A scheduling mode that allows you to define how strictly the rule is enforced
- requiredDuringSchedulingIgnoredDuringExecution: Pods must be scheduled in a way that satisfies the defined rule. If no topologies that meet the rule's requirements are available, the pod will not be scheduled at all. It will remain in a pending state until a suitable node becomes available.
- preferredDuringSchedulingIgnoredDuringExecution: This rule type is more flexible. It expresses a preference for scheduling pods based on the defined rule but doesn't enforce a strict requirement. If topologies that meet the preference criteria are available, Kubernetes will try to schedule the pod there. However, if no such topologies are available, the pod can still be scheduled on other nodes that do not violate the preference. When using this parameter, you also have to pass a weight parameter (a number between 1-100) that defines affinity priority when you specify multiple rules.
A Label Selector used to target specific pods for which the affinity will be applied.
A Topology Key that defines the key that the node needs to share to define a Topology. You can use any node label for this parameter. Common examples of topology keys are:
- kuberetes.io/hostname - Pods scheduling is based on node hostnames.
- kubernetes.io/arch - Pods scheduling is based on node CPU architectures.
- topology.kubernetes.io/zone - Pods scheduling is based on availability zones.
- topology.kubernetes.io/region - Pods scheduling is based on node regions.

You might wonder why there are no requiredDuringExecution type parameters. The reason is that, as of now, Kube does not offer native support for descheduling pods. Consequently, parameters of this type do not exist. However, you can still enforce the eviction of pods that violate affinity rules or spread constraints using external tools like the Kubernetes descheduler.

PodAffinity behavior

Let’s dive into a practical example. The following defines a Pod that has strictly enforced affinity to itself, the topology being a single node. We'll now examine how multiple replicas of this Pod are scheduled within a cluster.

apiVersion: v1
kind: Pod
metadata:
  name: self-affinity-pod
  labels:
    app: self-affinity-app
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - self-affinity-app
        topologyKey: topology.kubernetes.io/zone
  containers:
  - name: pause
    image: registry.k8s.io/pause:2.0

Initially, there are no replicas of our pod on the cluster. Since the Pod has an affinity requirement to itself, its placement will be random (on topologies that match eventual tolerations & node selectors). If it lacked self-affinity and had an affinity to another Pod, it would have remained in a "pending" state.

Subsequently, additional replicas are scheduled on the same node, adhering to the specified affinity rule.

As soon as the designated topology reaches its capacity (e.g., when there is insufficient memory available on the node to accommodate a new Pod, highlighted in red below), the next replica will remain in a "pending" state. This is due to the strict enforcement of the affinity rule using requiredDuringSchedulingIgnoredDuringExecution.

Let's explore the scenario where the affinity rule is not strictly enforced. Initially, the scheduling process will proceed as it did previously. However, when the topology reaches its capacity, the next Pod in line for scheduling will be assigned to a random topology, and the affinity rule will once again be applied until the topology reaches its maximum capacity.

apiVersion: v1
kind: Pod
metadata:
  name: self-affinity-pod
  labels:
    app: self-affinity-app
spec:
	affinity:
	  podAffinity:
	    preferredDuringSchedulingIgnoredDuringExecution:
	    - weight: 100
	      podAffinityTerm:
	        labelSelector:
	          matchExpressions:
	          - key: app
	            operator: In
	            values:
	            - self-affinity-app
	        topologyKey: kubernetes.io/hostname
  containers:
  - name: pause
    image: registry.k8s.io/pause:2.0

You can also use both requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution for the same pod to strictly enforce some affinity rules and apply other rules when possible.

PodAntiAffinity behavior

PodAntiAffinity is used to prevent the simultaneous scheduling of pods on the same topology. Let’s dive into an example that is analogous to the previous one, where the pods are scheduled with an anti-affinity rule to themselves.

Initially, the first pod is scheduled onto a topology at random. When the rule is strictly enforced using requiredDuringSchedulingIgnoredDuringExecution, pods will continue to be allocated to different topologies until each topology houses one pod. After this point, pods will remain in a "pending" status, awaiting available topologies to accommodate them.

If the affinity rule is not strictly enforced (i.e. with a preferedDuringScheduling... statement), pods will schedule like shown above. But once all topologies are full, they will continue to be scheduled following the default scheduler rules and fill all topologies.

What are TopologySpreadConstraints?

As their name implies, TopologySpreadConstraints are scheduling constraints that allow you to evenly spread pods onto topologies. More often than not, setting pods anti-affinity to themselves is misused instead of a spread constraint. They require 3 parameters:

LabelSelector: A Selector that is used to target specific pods for which the constraint will be applied.
TopologyKey: A key that defines the key that the node needs to share to define a Topology.
maxSkew: A maximum value for the skew, which is the maximum difference in the number of pods that exist on different topologies.
An instruction for the scheduler to know what to do when the constraint is unsatisfiable. Similarly to the requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution parameters for pod affinities, it can be set to DoNotSchedule or ScheduleAnyway (whenUnsatisfiable)

When respecting spread constraints, pods are free to schedule in any way on topologies as long as the following rules are respected:

The maximum difference in the number of pods that exist on different topologies is always inferior (or equal) to the maxSkew parameter
If whenUnsatisfiable is set to DoNotSchedule, pods will remain in a pending state if the rule cannot be respected

For instance, the following spread constraint will enforce pods to be spread on different nodes, with a maximum skew of 2

apiVersion: v1
kind: Pod
metadata:
  name: spread-pod
  labels:
    app: spread-app
spec:
	affinity:
		topologySpreadConstraints:
	    - maxSkew: 2
	      topologyKey: kubernetes.io/hostname
	      whenUnsatisfiable: DoNotSchedule
	      labelSelector:
           matchExpressions:
             - key: app
               operator: In
               values:
                 - spread-app
  containers:
	  - name: pause
	    image: registry.k8s.io/pause:2.0

You may note that you can define cluster-level spread constaints on a non-managed cluster and that the following constraints are followed by the Kubernetes scheduler as of Kubernetes 1.24.

defaultConstraints:
  - maxSkew: 3
    topologyKey: "kubernetes.io/hostname"
    whenUnsatisfiable: ScheduleAnyway
  - maxSkew: 5
    topologyKey: "topology.kubernetes.io/zone"
    whenUnsatisfiable: ScheduleAnyway

How to choose between PodAntiAffinity & TopologySpreadConstraints?

TopologySpreadConstraints, often set with a maxSkew of 1, serve as an alternative to strict self-anti-affinity rules to prevent pods from landing on the same node. This works well in larger clusters, but if there are fewer nodes than pods, spread constraints will not prevent pods from scheduling on the same node.

As their name implies, TopologySpreadConstraints should be used to evenly spread pods onto topologies and should never be used to schedule pods that have difficulty working with each other.

Practical use cases

Here are a few practical use cases of using Affinity rules and spread constraints to understand better when to use them:

It's imperative to enforce anti-affinity rules when dealing with two pods accessing the same folder within the parent node's filesystem, as their simultaneous presence will inevitably lead to a crash.
Most of the time, TopologySpreadConstraints are used to spread pods onto different nodes or availability zones to achieve High Availability
In a database replication setup, you want to ensure that primary and secondary database pods do not run on the same node to prevent a single point of failure. You can use TopologySpreadConstraints to ensure that replicas are placed on different nodes.
In a large-scale batch processing system, you may have multiple jobs sharing common resources like GPUs or specialized hardware. PodAffinity can be used to group pods requiring the same resource type on nodes equipped with that resource to make a more efficient use of it.
In a Microservices architecture such as an e-commerce website, different microservices (e.g., catalog, cart, and payment) may need to communicate frequently. You can use PodAffinity to schedule pods of related microservices on nodes that are close to each other to reduce network latency.
Anti-affinity can also be beneficial in scenarios where you want to prevent placing a pod on the same node if it could disrupt the performance of another existing pod. For instance, pods that frequently require high CPU bursts.

To go further

If you want to experiment with advanced pod scheduling by yourself, a great way to do so is to install Minikube and start a local Kubernetes cluster in which the number of pods per node is limited. Other solutions are also available to run local Kubernetes clusters.

# This will start a Kubernetes cluster with 4 nodes that can host a maximum of 8 pods each
minikube start --nodes 4 --extra-config=kubelet.max-pods=8

Use the following boilerplate Kubernetes deployment manifest and update it to experiment with affinities and spread constraints:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: self-affinity-required
spec:
  replicas: 10
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: pause
        image: registry.k8s.io/pause:2.0
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - my-app
            topologyKey: kubernetes.io/hostname

Finally, to master pod scheduling within your Kubernetes cluster, it's essential to have a solid understanding of concepts related to Taints and Tolerations.