Kubernetes Pod Scheduling: From Placement to Troubleshooting

Understanding how Kubernetes places pods — and why it sometimes doesn't — is one of the most practical skills you can build as a platform engineer. This article walks through the scheduler pipeline, the key mechanisms that control placement, and a step-by-step debugging approach when a pod gets stuck.

How the Kubernetes Scheduler Places a Pod

When a pod is created, it starts in a Pending state. The kube-scheduler watches for unscheduled pods and runs every candidate node through two phases:

1
Filtering Remove nodes that cannot run the pod
2
Scoring Rank remaining nodes by preference
3
Binding Assign pod to highest-scoring node
4
Running kubelet pulls image and starts containers

Filtering is hard — if a node fails a filter, the pod will not run there regardless of how much spare capacity it has. The most common filters are:

  • Resource fit — node has enough CPU and memory for the pod's requests
  • Taints — node rejects pods that don't tolerate its taints
  • Node affinity (required) — pod has a hard rule about which nodes it will run on
  • Pod anti-affinity (required) — pod must not land on a node where a conflicting pod already runs
  • Volume availability — node is in the right zone for the pod's PersistentVolume

Scoring is soft — it ranks nodes based on preferences like spreading pods across zones, balancing resource utilisation, or honouring preferred affinity rules. If only one node passes filtering, the pod goes there regardless of score.

Key insight: A pod stuck in Pending almost always means every node failed filtering. Scoring is never reached. Your troubleshooting should focus on identifying which filter is eliminating all nodes.

Taints and Tolerations

A taint is applied to a node and repels pods from being scheduled there. A toleration is applied to a pod and allows it to be scheduled on a tainted node. Think of it as a lock (taint) and key (toleration).

Taint Effects

Every taint has an effect that controls how strictly it is enforced:

NoSchedule

New pods without a matching toleration will not be scheduled on this node. Existing pods are not affected.

PreferNoSchedule

Scheduler will try to avoid placing pods here, but will if no other node is available. A soft version of NoSchedule.

NoExecute

New pods without a toleration are not scheduled, AND existing pods without a toleration are evicted. Used during node drain.

Example: Dedicated GPU Nodes

Taint a node so only GPU workloads land on it:

# Taint the node
kubectl taint nodes gpu-node-1 workload=gpu:NoSchedule

Add a toleration to pods that should run on GPU nodes:

tolerations:
  - key: "workload"
    operator: "Equal"
    value: "gpu"
    effect: "NoSchedule"

Common Use Cases

  • Dedicated infrastructure nodes — taint nodes reserved for monitoring, ingress controllers, or system components
  • Spot / preemptible nodes — taint spot nodes so only batch/fault-tolerant workloads land there
  • Node drain — Kubernetes automatically applies a NoExecute taint during kubectl drain
  • GPU / specialised hardware — restrict GPU nodes to ML workloads only
Common mistake: Forgetting to add tolerations to DaemonSets. System DaemonSets (e.g. logging agents, CNI plugins) need NoSchedule tolerations or they won't run on tainted nodes — which can break cluster networking or observability.

Scenario: Two Node Pools — One Tainted, One Not

This is one of the most common sources of confusion when working with dedicated node pools. Consider this setup:

# Node Pool A — GPU / dedicated workload nodes
kubectl taint nodes -l nodepool=gpu workload=gpu:NoSchedule

# Node Pool B — general purpose nodes (no taint)
# (no taint applied)

Now a pod is created with a toleration for workload=gpu:NoSchedule. Where does it land?

❌ Common misconception

"The pod has a toleration for the GPU taint, so it will go to Node Pool A."

Wrong. A toleration only removes the repulsion from a tainted node. It does not attract the pod to it. The scheduler will happily place the pod on Node Pool B (no taint, no barrier) if it scores better.

✅ What actually happens

The pod is eligible for both node pools. The scheduler will place it wherever resources are most available — which is likely Node Pool B, since it has no taint blocking other pods and is probably busier with general workloads.

To guarantee the pod lands on Node Pool A, you need both a toleration and a node affinity (or nodeSelector):

spec:
  # 1. Toleration — removes the taint barrier (required to enter Node Pool A)
  tolerations:
    - key: "workload"
      operator: "Equal"
      value: "gpu"
      effect: "NoSchedule"

  # 2. Node affinity — actively pulls the pod to Node Pool A
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: nodepool
                operator: In
                values:
                  - gpu

  containers:
    - name: my-app
      image: my-app:latest

The same logic applies in reverse — if you want a pod to only run on Node Pool B (general nodes, never on GPU nodes), you do not need a toleration. Simply add a node affinity pointing to Node Pool B. The taint on Pool A will naturally repel the pod since it has no toleration.

Rule of thumb for dedicated node pools:
  • Taint the dedicated nodes → repels all general workloads
  • Add toleration to dedicated pods → allows entry to tainted nodes
  • Add node affinity to dedicated pods → ensures they actually land there
Toleration alone is never sufficient for dedicated scheduling. Always pair it with node affinity.

Most managed Kubernetes providers apply this pattern automatically for specialised node pools. For example, when you configure a GPU node pool on AKS or GKE, the node pool is tainted and your workload needs both a toleration and a node selector/affinity to use it:

# AKS GPU node pool example
tolerations:
  - key: "sku"
    operator: "Equal"
    value: "gpu"
    effect: "NoSchedule"
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: agentpool
              operator: In
              values:
                - gpunodepool
# GKE GPU node pool example
tolerations:
  - key: "cloud.google.com/gke-accelerator"
    operator: "Exists"
    effect: "NoSchedule"
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: cloud.google.com/gke-nodepool
              operator: In
              values:
                - gpu-node-pool

Provider-specific Node Taints

Managed Kubernetes providers automatically taint nodes in certain situations:

  • EKS Spot nodes: eks.amazonaws.com/capacity-type=SPOT:NoSchedule
  • AKS Spot nodes: kubernetes.azure.com/scalesetpriority=spot:NoSchedule
  • GKE Spot nodes: cloud.google.com/gke-spot=true:NoSchedule
  • K8s Not-ready nodes: node.kubernetes.io/not-ready:NoExecute (automatic)

Node Affinity

Node affinity allows pods to express rules about which nodes they prefer or require based on node labels. It replaces the older nodeSelector field with more expressive operators (In, NotIn, Exists, Gt, etc.).

Required vs Preferred

requiredDuringSchedulingIgnoredDuringExecution

Hard rule. Pod will not schedule if no node matches. Treated as a filter — eliminates non-matching nodes entirely.

preferredDuringSchedulingIgnoredDuringExecution

Soft rule. Scheduler will prefer matching nodes but will place the pod elsewhere if needed. Used in scoring phase.

Note on "IgnoredDuringExecution": Both types ignore affinity rules for pods that are already running. If you change a node label, existing pods are not evicted. A future RequiredDuringExecution type exists in alpha that would evict pods when labels change.

Example: Schedule in a Specific Zone

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: topology.kubernetes.io/zone
              operator: In
              values:
                - us-east-1a
                - us-east-1b

Example: Prefer High-Memory Nodes

affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80
        preference:
          matchExpressions:
            - key: node.kubernetes.io/instance-type
              operator: In
              values:
                - r5.2xlarge
                - r5.4xlarge

Common Node Labels by Provider

These labels are automatically applied to nodes and are commonly used in affinity rules:

# Standard K8s labels (all providers)
topology.kubernetes.io/zone          # e.g. us-east-1a, eastus-1
topology.kubernetes.io/region        # e.g. us-east-1, eastus
kubernetes.io/os                     # linux, windows
kubernetes.io/arch                   # amd64, arm64
node.kubernetes.io/instance-type     # VM size/type

# AWS EKS
eks.amazonaws.com/nodegroup          # node group name
eks.amazonaws.com/capacity-type      # ON_DEMAND or SPOT

# Azure AKS
kubernetes.azure.com/agentpool       # node pool name
kubernetes.azure.com/node-image-version

# GCP GKE
cloud.google.com/gke-nodepool        # node pool name
cloud.google.com/machine-family      # e.g. n2, c2, t2d

Pod Anti-Affinity

While node affinity selects nodes based on node labels, pod anti-affinity makes scheduling decisions based on what pods are already running on a node or in a topology zone. It is the primary mechanism for spreading replicas to avoid a single point of failure.

Example: Spread Replicas Across Nodes

Ensure no two replicas of the same app land on the same node:

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
            - key: app
              operator: In
              values:
                - my-api
        topologyKey: kubernetes.io/hostname

Example: Spread Across Availability Zones (Soft)

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
              - key: app
                operator: In
                values:
                  - my-api
          topologyKey: topology.kubernetes.io/zone

Topology Spread Constraints — the modern alternative

For spreading across zones or nodes, Topology Spread Constraints (stable since K8s 1.24) are now preferred over pod anti-affinity. They give more control over how evenly pods are distributed:

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: my-api
Hard anti-affinity trap: Using requiredDuringSchedulingIgnoredDuringExecution with pod anti-affinity and topologyKey: kubernetes.io/hostname means your deployment can never have more replicas than nodes. If you scale to 5 replicas but only have 3 nodes, 2 pods will stay Pending forever. Use preferred or Topology Spread Constraints with whenUnsatisfiable: ScheduleAnyway instead.

Pod Disruption Budgets (PDB)

A PDB is not a scheduling rule — it is an eviction budget. It tells Kubernetes how many pods of a given workload can be voluntarily disrupted at the same time. It applies during:

  • kubectl drain — draining a node for maintenance or upgrade
  • Cluster autoscaler scale-down — removing an underutilised node
  • Rolling upgrades — managed K8s node pool upgrades (EKS, AKS, GKE)
PDB does not affect scheduling. A PDB will not prevent a pod from being scheduled. It only blocks voluntary eviction. Involuntary disruptions (node failure, OOMKill) ignore PDBs entirely.

minAvailable vs maxUnavailable

minAvailable

At least this many pods must remain running during disruption. Can be a number or percentage. Eviction is blocked if it would drop below this.

maxUnavailable

At most this many pods can be unavailable at once. Kubernetes will wait for pods to recover before evicting more.

Example: Allow one pod down at a time

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-api-pdb
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: my-api

Example: Always keep 80% of pods running

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-api-pdb
spec:
  minAvailable: "80%"
  selector:
    matchLabels:
      app: my-api
The node drain deadlock: If you set minAvailable: 100% (or maxUnavailable: 0) and have only one replica, kubectl drain will hang indefinitely — it cannot evict the pod without violating the PDB. Always ensure your PDB allows at least one pod to be disrupted, or scale replicas above the minimum before draining.

How They Work Together

These mechanisms operate at different stages and can interact in non-obvious ways:

Pod created (Pending)
       │
       ▼
 ┌─────────────────────────────────────────────────┐
 │              FILTERING (hard rules)             │
 │  1. Resource requests fit node capacity?        │
 │  2. Node tolerates the pod's taints?  ◄──────── │── Taints & Tolerations
 │  3. Node matches required node affinity? ◄───── │── Node Affinity (required)
 │  4. No conflicting pod on node/zone? ◄───────── │── Pod Anti-Affinity (required)
 │  5. Volume available in node's zone?            │
 └──────────────────────┬──────────────────────────┘
                        │ Nodes remaining after filtering
                        ▼
 ┌─────────────────────────────────────────────────┐
 │               SCORING (soft rules)              │
 │  - Preferred node affinity weight               │
 │  - Preferred pod anti-affinity weight           │
 │  - Resource balancing across nodes              │
 │  - Topology spread constraints                  │
 └──────────────────────┬──────────────────────────┘
                        │ Highest scoring node
                        ▼
                  Pod Bound → Running
                        │
              (Later, during drain/upgrade)
                        │
                        ▼
 ┌─────────────────────────────────────────────────┐
 │            PDB CHECK (eviction gate)            │
 │  Can this pod be evicted without violating      │
 │  minAvailable / maxUnavailable?                 │
 └─────────────────────────────────────────────────┘

A common real-world interaction:

  • You have 3 nodes and a deployment with required pod anti-affinity per node
  • You scale to 4 replicas → one pod stuck Pending (no eligible node)
  • You try to drain a node for upgrade → drain hangs because the PDB requires 3 pods running but one is already Pending
  • Fix: Use preferred anti-affinity or add a fourth node before scaling

Replicas and Horizontal Pod Autoscaling (HPA)

Every pod that is created — whether manually via replicas or automatically via HPA — goes through the same scheduler pipeline. Understanding how replica count interacts with scheduling rules, PDB, and node capacity is critical for running reliable workloads.

Setting Replicas Manually

The replicas field in a Deployment, StatefulSet, or ReplicaSet defines how many pod instances Kubernetes should maintain at all times. The scheduler places each replica independently — each one must pass all filters.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-api
spec:
  replicas: 3          # desired number of pods
  selector:
    matchLabels:
      app: my-api
  template:
    spec:
      containers:
        - name: my-api
          image: my-api:latest
          resources:
            requests:
              cpu: "250m"
              memory: "256Mi"
            limits:
              cpu: "500m"
              memory: "512Mi"
Always set resource requests. Without requests, the scheduler has no information to make placement decisions and will treat the pod as requiring zero resources. This leads to nodes being overcommitted and pods being OOMKilled or throttled at runtime.

Horizontal Pod Autoscaler (HPA)

HPA automatically adjusts the replicas field of a workload based on observed metrics — scaling up when load increases and scaling down when load drops. It runs as a control loop, checking metrics every 15 seconds by default.

Scale Up

HPA increases replicas when current metric value exceeds the target. New pods are created and go through the scheduler. If nodes are full, pods go Pending.

Scale Down

HPA decreases replicas when load drops. Before terminating a pod, Kubernetes checks the PDB. If the PDB would be violated, scale-down is blocked.

Example: CPU-based HPA

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-api
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70    # scale up when avg CPU > 70%

Example: Memory + Custom Metric HPA

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-api
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "100"
HPA requires metrics-server. CPU and memory-based HPA depends on metrics-server being installed in the cluster. Most managed K8s providers (EKS, AKS, GKE) include it by default. For custom metrics you need the custom metrics API or KEDA.

KEDA — Event-driven Autoscaling

KEDA (Kubernetes Event-Driven Autoscaling) extends HPA to scale based on external event sources — queue depth, database row count, Prometheus metrics, and more. It can also scale a deployment to zero when there are no events, which native HPA cannot do (minimum is 1 replica).

# KEDA ScaledObject — scale based on Azure Service Bus queue depth
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: my-worker-scaler
spec:
  scaleTargetRef:
    name: my-worker
  minReplicaCount: 0       # scale to zero when queue is empty
  maxReplicaCount: 20
  triggers:
    - type: azure-servicebus
      metadata:
        queueName: my-queue
        queueLength: "5"   # one replica per 5 messages in queue

How HPA Interacts with Scheduling Rules

When HPA scales up, the new pods must pass all the same scheduling filters. This creates several common failure patterns:

HPA scales up → pods Pending

Node capacity is exhausted or anti-affinity prevents placement. HPA keeps trying. Fix: enable Cluster Autoscaler to add nodes, or relax anti-affinity from required to preferred.

HPA + hard anti-affinity = replica ceiling

With required pod anti-affinity per hostname, maxReplicas is effectively capped at node count. Set maxReplicas no higher than your node count, or use preferred.

HPA scales down → PDB blocks it

HPA wants to remove a pod but the PDB says minimum is already met. HPA will keep retrying with a cooldown. This is usually correct behaviour — the PDB is protecting availability.

HPA fights with minReplicas vs PDB minAvailable

If HPA minReplicas: 2 and PDB minAvailable: 2, a single node drain will block because HPA won't go below 2 and PDB won't allow 1 down. Set PDB maxUnavailable: 1 instead.

Recommended: Align HPA, PDB, and Anti-Affinity

# Deployment
replicas: 2            # starting point, HPA will manage this

# HPA
minReplicas: 2         # never go below 2 (HA baseline)
maxReplicas: 10        # headroom for scale-up

# PDB — use maxUnavailable, not minAvailable, to avoid conflicts
maxUnavailable: 1      # allow 1 pod down at a time during drain

# Anti-affinity — use preferred so HPA isn't capped at node count
preferredDuringSchedulingIgnoredDuringExecution:
  - weight: 100
    podAffinityTerm:
      topologyKey: kubernetes.io/hostname
Use maxUnavailable in PDB rather than minAvailable when you have HPA. maxUnavailable: 1 scales correctly with whatever the current replica count is. minAvailable: 2 becomes a problem if HPA ever scales down to exactly 2 — suddenly zero disruptions are allowed.

Troubleshooting: Pod Stuck in Pending

Follow this sequence when a pod won't schedule or a drain won't complete.

1

Describe the pod — read the Events section

The scheduler writes the reason for failure directly into pod events. This is always your first stop.

kubectl describe pod <pod-name> -n <namespace>

Look for messages like:

  • 0/3 nodes are available: 3 node(s) had untolerated taint → taint/toleration mismatch
  • 0/3 nodes are available: 3 Insufficient cpu → resource requests too high
  • 0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector → affinity misconfiguration
  • 0/3 nodes are available: 3 node(s) didn't satisfy existing pods anti-affinity rules → too many replicas for available nodes
2

Check node capacity and conditions

Verify nodes are Ready and have available resources.

# Node status and conditions
kubectl get nodes
kubectl describe node <node-name>

# Resource usage across all nodes
kubectl top nodes

# See allocatable vs requested resources
kubectl describe nodes | grep -A 5 "Allocated resources"
3

Check taints on nodes

List all taints across your nodes to spot mismatches with your pod's tolerations.

# All taints on all nodes
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

# Check a specific node
kubectl describe node <node-name> | grep Taint
4

Check node labels vs affinity rules

Verify the labels your affinity rules reference actually exist on nodes.

# All labels on all nodes
kubectl get nodes --show-labels

# Filter nodes matching a specific label
kubectl get nodes -l topology.kubernetes.io/zone=us-east-1a
5

Check PDB status when drain hangs

If kubectl drain is stuck, check whether a PDB is blocking eviction.

# List all PDBs and their current status
kubectl get pdb -A

# Detailed view — check ALLOWED DISRUPTIONS column
kubectl get pdb -A -o wide

# If ALLOWED DISRUPTIONS is 0, the drain is blocked
# Check how many pods are currently available
kubectl get pods -l app=<app-name> -o wide

If a PDB shows 0 allowed disruptions and the drain is stuck, options are:

  • Scale up the deployment so more replicas are available, then retry drain
  • Temporarily delete the PDB if urgency requires it (kubectl delete pdb <name>) — recreate after drain
  • Force eviction with kubectl drain --disable-eviction — bypasses PDB but risks disruption
6

Check cluster events for broader picture

Scheduler failures and eviction events appear in cluster-wide events.

# Recent events sorted by time
kubectl get events -A --sort-by='.lastTimestamp' | tail -30

# Filter for Warning events only
kubectl get events -A --field-selector type=Warning
7

Debug HPA not scaling as expected

Check HPA status — it shows current vs desired replicas and the reason it's not scaling.

# Current HPA state
kubectl get hpa -n <namespace>

# Detailed conditions explaining why HPA is or isn't scaling
kubectl describe hpa <hpa-name> -n <namespace>

Common conditions to look for:

  • AbleToScale: False / DesiredWithinRange — already at min or max replicas
  • ScalingActive: False / FailedGetResourceMetric — metrics-server not available or pod has no resource requests set
  • ScalingLimited: True / TooManyReplicas — HPA wants more but PDB or anti-affinity is blocking new pods
# Check if metrics-server is running
kubectl get pods -n kube-system | grep metrics-server

# Check current resource usage seen by HPA
kubectl top pods -n <namespace>
8

Simulate scheduling with dry-run

Check if a pod spec would be schedulable without actually creating it.

kubectl apply -f my-pod.yaml --dry-run=server

Or use the scheduler extender to explain why a pod can't be placed:

# Check if a specific node would accept the pod
kubectl describe pod <pending-pod> | grep -A 20 Events

Key Takeaways

  • Pending = filtering eliminated all nodes. The scheduler never reaches scoring. kubectl describe pod will tell you exactly which filter failed.
  • Taints are node-level gates. A pod without the right toleration will never land on a tainted node, regardless of available resources.
  • Required affinity/anti-affinity is a hard constraint. If the rules can't be satisfied, pods stay Pending. Use preferred unless you truly need a hard requirement.
  • Hard pod anti-affinity caps your replica count to the number of matching nodes. Scale your nodes first, or switch to Topology Spread Constraints.
  • PDB blocks eviction, not scheduling. A misconfigured PDB (minAvailable: 100% with one replica) will silently deadlock a node drain or cluster upgrade.
  • Topology Spread Constraints are the modern replacement for pod anti-affinity spreading. Prefer them for zone/node spreading in new workloads.
  • Always set resource requests on pods — HPA cannot function without them, and the scheduler makes poor placement decisions without them.
  • HPA + hard anti-affinity creates a hidden replica ceiling equal to your node count. Use preferred anti-affinity or Topology Spread Constraints so HPA can actually scale.
  • Use maxUnavailable in PDB instead of minAvailable when HPA is in use — it scales proportionally with replica count and avoids deadlocks during drain.
  • KEDA extends HPA to scale on external events (queues, topics, DBs) and supports scale-to-zero — something native HPA cannot do.