Understanding how Kubernetes places pods — and why it sometimes doesn't — is one of the most practical skills you can build as a platform engineer. This article walks through the scheduler pipeline, the key mechanisms that control placement, and a step-by-step debugging approach when a pod gets stuck.
How the Kubernetes Scheduler Places a Pod
When a pod is created, it starts in a Pending state. The
kube-scheduler watches for unscheduled pods and runs every candidate node
through two phases:
Filtering is hard — if a node fails a filter, the pod will not run there regardless of how much spare capacity it has. The most common filters are:
- Resource fit — node has enough CPU and memory for the pod's requests
- Taints — node rejects pods that don't tolerate its taints
- Node affinity (required) — pod has a hard rule about which nodes it will run on
- Pod anti-affinity (required) — pod must not land on a node where a conflicting pod already runs
- Volume availability — node is in the right zone for the pod's PersistentVolume
Scoring is soft — it ranks nodes based on preferences like spreading pods across zones, balancing resource utilisation, or honouring preferred affinity rules. If only one node passes filtering, the pod goes there regardless of score.
Pending almost always means
every node failed filtering. Scoring is never reached. Your troubleshooting should
focus on identifying which filter is eliminating all nodes.
Taints and Tolerations
A taint is applied to a node and repels pods from being scheduled there. A toleration is applied to a pod and allows it to be scheduled on a tainted node. Think of it as a lock (taint) and key (toleration).
Taint Effects
Every taint has an effect that controls how strictly it is enforced:
NoSchedule
New pods without a matching toleration will not be scheduled on this node. Existing pods are not affected.
PreferNoSchedule
Scheduler will try to avoid placing pods here, but will if no other node is available. A soft version of NoSchedule.
NoExecute
New pods without a toleration are not scheduled, AND existing pods without a toleration are evicted. Used during node drain.
Example: Dedicated GPU Nodes
Taint a node so only GPU workloads land on it:
# Taint the node
kubectl taint nodes gpu-node-1 workload=gpu:NoSchedule
Add a toleration to pods that should run on GPU nodes:
tolerations:
- key: "workload"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
Common Use Cases
- Dedicated infrastructure nodes — taint nodes reserved for monitoring, ingress controllers, or system components
- Spot / preemptible nodes — taint spot nodes so only batch/fault-tolerant workloads land there
- Node drain — Kubernetes automatically applies a
NoExecutetaint duringkubectl drain - GPU / specialised hardware — restrict GPU nodes to ML workloads only
NoSchedule tolerations or they won't run
on tainted nodes — which can break cluster networking or observability.
Scenario: Two Node Pools — One Tainted, One Not
This is one of the most common sources of confusion when working with dedicated node pools. Consider this setup:
# Node Pool A — GPU / dedicated workload nodes
kubectl taint nodes -l nodepool=gpu workload=gpu:NoSchedule
# Node Pool B — general purpose nodes (no taint)
# (no taint applied)
Now a pod is created with a toleration for workload=gpu:NoSchedule.
Where does it land?
❌ Common misconception
"The pod has a toleration for the GPU taint, so it will go to Node Pool A."
Wrong. A toleration only removes the repulsion from a tainted node. It does not attract the pod to it. The scheduler will happily place the pod on Node Pool B (no taint, no barrier) if it scores better.
✅ What actually happens
The pod is eligible for both node pools. The scheduler will place it wherever resources are most available — which is likely Node Pool B, since it has no taint blocking other pods and is probably busier with general workloads.
To guarantee the pod lands on Node Pool A, you need both a toleration and a node affinity (or nodeSelector):
spec:
# 1. Toleration — removes the taint barrier (required to enter Node Pool A)
tolerations:
- key: "workload"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
# 2. Node affinity — actively pulls the pod to Node Pool A
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nodepool
operator: In
values:
- gpu
containers:
- name: my-app
image: my-app:latest
The same logic applies in reverse — if you want a pod to only run on Node Pool B (general nodes, never on GPU nodes), you do not need a toleration. Simply add a node affinity pointing to Node Pool B. The taint on Pool A will naturally repel the pod since it has no toleration.
- Taint the dedicated nodes → repels all general workloads
- Add toleration to dedicated pods → allows entry to tainted nodes
- Add node affinity to dedicated pods → ensures they actually land there
Most managed Kubernetes providers apply this pattern automatically for specialised node pools. For example, when you configure a GPU node pool on AKS or GKE, the node pool is tainted and your workload needs both a toleration and a node selector/affinity to use it:
# AKS GPU node pool example
tolerations:
- key: "sku"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: agentpool
operator: In
values:
- gpunodepool
# GKE GPU node pool example
tolerations:
- key: "cloud.google.com/gke-accelerator"
operator: "Exists"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-nodepool
operator: In
values:
- gpu-node-pool
Provider-specific Node Taints
Managed Kubernetes providers automatically taint nodes in certain situations:
-
EKS
Spot nodes:
eks.amazonaws.com/capacity-type=SPOT:NoSchedule -
AKS
Spot nodes:
kubernetes.azure.com/scalesetpriority=spot:NoSchedule -
GKE
Spot nodes:
cloud.google.com/gke-spot=true:NoSchedule -
K8s
Not-ready nodes:
node.kubernetes.io/not-ready:NoExecute(automatic)
Node Affinity
Node affinity allows pods to express rules about which nodes they prefer or require
based on node labels. It replaces the older nodeSelector field with more expressive
operators (In, NotIn, Exists, Gt, etc.).
Required vs Preferred
requiredDuringSchedulingIgnoredDuringExecution
Hard rule. Pod will not schedule if no node matches. Treated as a filter — eliminates non-matching nodes entirely.
preferredDuringSchedulingIgnoredDuringExecution
Soft rule. Scheduler will prefer matching nodes but will place the pod elsewhere if needed. Used in scoring phase.
RequiredDuringExecution type exists in alpha that would evict
pods when labels change.
Example: Schedule in a Specific Zone
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1a
- us-east-1b
Example: Prefer High-Memory Nodes
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- r5.2xlarge
- r5.4xlarge
Common Node Labels by Provider
These labels are automatically applied to nodes and are commonly used in affinity rules:
# Standard K8s labels (all providers)
topology.kubernetes.io/zone # e.g. us-east-1a, eastus-1
topology.kubernetes.io/region # e.g. us-east-1, eastus
kubernetes.io/os # linux, windows
kubernetes.io/arch # amd64, arm64
node.kubernetes.io/instance-type # VM size/type
# AWS EKS
eks.amazonaws.com/nodegroup # node group name
eks.amazonaws.com/capacity-type # ON_DEMAND or SPOT
# Azure AKS
kubernetes.azure.com/agentpool # node pool name
kubernetes.azure.com/node-image-version
# GCP GKE
cloud.google.com/gke-nodepool # node pool name
cloud.google.com/machine-family # e.g. n2, c2, t2d
Pod Anti-Affinity
While node affinity selects nodes based on node labels, pod anti-affinity makes scheduling decisions based on what pods are already running on a node or in a topology zone. It is the primary mechanism for spreading replicas to avoid a single point of failure.
Example: Spread Replicas Across Nodes
Ensure no two replicas of the same app land on the same node:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- my-api
topologyKey: kubernetes.io/hostname
Example: Spread Across Availability Zones (Soft)
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- my-api
topologyKey: topology.kubernetes.io/zone
Topology Spread Constraints — the modern alternative
For spreading across zones or nodes, Topology Spread Constraints (stable since K8s 1.24) are now preferred over pod anti-affinity. They give more control over how evenly pods are distributed:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-api
requiredDuringSchedulingIgnoredDuringExecution with pod anti-affinity and
topologyKey: kubernetes.io/hostname means your deployment can never
have more replicas than nodes. If you scale to 5 replicas but only have 3 nodes,
2 pods will stay Pending forever. Use preferred or Topology
Spread Constraints with whenUnsatisfiable: ScheduleAnyway instead.
Pod Disruption Budgets (PDB)
A PDB is not a scheduling rule — it is an eviction budget. It tells Kubernetes how many pods of a given workload can be voluntarily disrupted at the same time. It applies during:
kubectl drain— draining a node for maintenance or upgrade- Cluster autoscaler scale-down — removing an underutilised node
- Rolling upgrades — managed K8s node pool upgrades (EKS, AKS, GKE)
minAvailable vs maxUnavailable
minAvailable
At least this many pods must remain running during disruption. Can be a number or percentage. Eviction is blocked if it would drop below this.
maxUnavailable
At most this many pods can be unavailable at once. Kubernetes will wait for pods to recover before evicting more.
Example: Allow one pod down at a time
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-api-pdb
spec:
maxUnavailable: 1
selector:
matchLabels:
app: my-api
Example: Always keep 80% of pods running
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-api-pdb
spec:
minAvailable: "80%"
selector:
matchLabels:
app: my-api
minAvailable: 100% (or
maxUnavailable: 0) and have only one replica, kubectl drain will
hang indefinitely — it cannot evict the pod without violating the PDB. Always ensure your
PDB allows at least one pod to be disrupted, or scale replicas above the minimum before
draining.
How They Work Together
These mechanisms operate at different stages and can interact in non-obvious ways:
Pod created (Pending)
│
▼
┌─────────────────────────────────────────────────┐
│ FILTERING (hard rules) │
│ 1. Resource requests fit node capacity? │
│ 2. Node tolerates the pod's taints? ◄──────── │── Taints & Tolerations
│ 3. Node matches required node affinity? ◄───── │── Node Affinity (required)
│ 4. No conflicting pod on node/zone? ◄───────── │── Pod Anti-Affinity (required)
│ 5. Volume available in node's zone? │
└──────────────────────┬──────────────────────────┘
│ Nodes remaining after filtering
▼
┌─────────────────────────────────────────────────┐
│ SCORING (soft rules) │
│ - Preferred node affinity weight │
│ - Preferred pod anti-affinity weight │
│ - Resource balancing across nodes │
│ - Topology spread constraints │
└──────────────────────┬──────────────────────────┘
│ Highest scoring node
▼
Pod Bound → Running
│
(Later, during drain/upgrade)
│
▼
┌─────────────────────────────────────────────────┐
│ PDB CHECK (eviction gate) │
│ Can this pod be evicted without violating │
│ minAvailable / maxUnavailable? │
└─────────────────────────────────────────────────┘
A common real-world interaction:
- You have 3 nodes and a deployment with
requiredpod anti-affinity per node - You scale to 4 replicas → one pod stuck
Pending(no eligible node) - You try to drain a node for upgrade → drain hangs because the PDB requires 3 pods running but one is already
Pending - Fix: Use
preferredanti-affinity or add a fourth node before scaling
Replicas and Horizontal Pod Autoscaling (HPA)
Every pod that is created — whether manually via replicas or automatically
via HPA — goes through the same scheduler pipeline. Understanding how replica count
interacts with scheduling rules, PDB, and node capacity is critical for running
reliable workloads.
Setting Replicas Manually
The replicas field in a Deployment, StatefulSet, or ReplicaSet defines how
many pod instances Kubernetes should maintain at all times. The scheduler places each
replica independently — each one must pass all filters.
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-api
spec:
replicas: 3 # desired number of pods
selector:
matchLabels:
app: my-api
template:
spec:
containers:
- name: my-api
image: my-api:latest
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
requests, the scheduler
has no information to make placement decisions and will treat the pod as requiring zero
resources. This leads to nodes being overcommitted and pods being OOMKilled or throttled
at runtime.
Horizontal Pod Autoscaler (HPA)
HPA automatically adjusts the replicas field of a workload based on observed
metrics — scaling up when load increases and scaling down when load drops. It runs as a
control loop, checking metrics every 15 seconds by default.
Scale Up
HPA increases replicas when current metric value exceeds the target. New pods are created and go through the scheduler. If nodes are full, pods go Pending.
Scale Down
HPA decreases replicas when load drops. Before terminating a pod, Kubernetes checks the PDB. If the PDB would be violated, scale-down is blocked.
Example: CPU-based HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-api
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # scale up when avg CPU > 70%
Example: Memory + Custom Metric HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-api
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
metrics-server being installed in the cluster. Most managed K8s providers
(EKS, AKS, GKE) include it by default. For custom metrics you need the
custom metrics API
or KEDA.
KEDA — Event-driven Autoscaling
KEDA (Kubernetes Event-Driven Autoscaling) extends HPA to scale based on external event sources — queue depth, database row count, Prometheus metrics, and more. It can also scale a deployment to zero when there are no events, which native HPA cannot do (minimum is 1 replica).
# KEDA ScaledObject — scale based on Azure Service Bus queue depth
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: my-worker-scaler
spec:
scaleTargetRef:
name: my-worker
minReplicaCount: 0 # scale to zero when queue is empty
maxReplicaCount: 20
triggers:
- type: azure-servicebus
metadata:
queueName: my-queue
queueLength: "5" # one replica per 5 messages in queue
How HPA Interacts with Scheduling Rules
When HPA scales up, the new pods must pass all the same scheduling filters. This creates several common failure patterns:
HPA scales up → pods Pending
Node capacity is exhausted or anti-affinity prevents placement. HPA keeps trying. Fix: enable Cluster Autoscaler to add nodes, or relax anti-affinity from required to preferred.
HPA + hard anti-affinity = replica ceiling
With required pod anti-affinity per hostname, maxReplicas is effectively capped at node count. Set maxReplicas no higher than your node count, or use preferred.
HPA scales down → PDB blocks it
HPA wants to remove a pod but the PDB says minimum is already met. HPA will keep retrying with a cooldown. This is usually correct behaviour — the PDB is protecting availability.
HPA fights with minReplicas vs PDB minAvailable
If HPA minReplicas: 2 and PDB minAvailable: 2, a single node drain will block because HPA won't go below 2 and PDB won't allow 1 down. Set PDB maxUnavailable: 1 instead.
Recommended: Align HPA, PDB, and Anti-Affinity
# Deployment
replicas: 2 # starting point, HPA will manage this
# HPA
minReplicas: 2 # never go below 2 (HA baseline)
maxReplicas: 10 # headroom for scale-up
# PDB — use maxUnavailable, not minAvailable, to avoid conflicts
maxUnavailable: 1 # allow 1 pod down at a time during drain
# Anti-affinity — use preferred so HPA isn't capped at node count
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
topologyKey: kubernetes.io/hostname
maxUnavailable in PDB rather than minAvailable
when you have HPA. maxUnavailable: 1 scales correctly with whatever the
current replica count is. minAvailable: 2 becomes a problem if HPA ever
scales down to exactly 2 — suddenly zero disruptions are allowed.
Troubleshooting: Pod Stuck in Pending
Follow this sequence when a pod won't schedule or a drain won't complete.
Describe the pod — read the Events section
The scheduler writes the reason for failure directly into pod events. This is always your first stop.
kubectl describe pod <pod-name> -n <namespace>
Look for messages like:
0/3 nodes are available: 3 node(s) had untolerated taint→ taint/toleration mismatch0/3 nodes are available: 3 Insufficient cpu→ resource requests too high0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector→ affinity misconfiguration0/3 nodes are available: 3 node(s) didn't satisfy existing pods anti-affinity rules→ too many replicas for available nodes
Check node capacity and conditions
Verify nodes are Ready and have available resources.
# Node status and conditions
kubectl get nodes
kubectl describe node <node-name>
# Resource usage across all nodes
kubectl top nodes
# See allocatable vs requested resources
kubectl describe nodes | grep -A 5 "Allocated resources"
Check taints on nodes
List all taints across your nodes to spot mismatches with your pod's tolerations.
# All taints on all nodes
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
# Check a specific node
kubectl describe node <node-name> | grep Taint
Check node labels vs affinity rules
Verify the labels your affinity rules reference actually exist on nodes.
# All labels on all nodes
kubectl get nodes --show-labels
# Filter nodes matching a specific label
kubectl get nodes -l topology.kubernetes.io/zone=us-east-1a
Check PDB status when drain hangs
If kubectl drain is stuck, check whether a PDB is blocking eviction.
# List all PDBs and their current status
kubectl get pdb -A
# Detailed view — check ALLOWED DISRUPTIONS column
kubectl get pdb -A -o wide
# If ALLOWED DISRUPTIONS is 0, the drain is blocked
# Check how many pods are currently available
kubectl get pods -l app=<app-name> -o wide
If a PDB shows 0 allowed disruptions and the drain is stuck, options are:
- Scale up the deployment so more replicas are available, then retry drain
- Temporarily delete the PDB if urgency requires it (
kubectl delete pdb <name>) — recreate after drain - Force eviction with
kubectl drain --disable-eviction— bypasses PDB but risks disruption
Check cluster events for broader picture
Scheduler failures and eviction events appear in cluster-wide events.
# Recent events sorted by time
kubectl get events -A --sort-by='.lastTimestamp' | tail -30
# Filter for Warning events only
kubectl get events -A --field-selector type=Warning
Debug HPA not scaling as expected
Check HPA status — it shows current vs desired replicas and the reason it's not scaling.
# Current HPA state
kubectl get hpa -n <namespace>
# Detailed conditions explaining why HPA is or isn't scaling
kubectl describe hpa <hpa-name> -n <namespace>
Common conditions to look for:
AbleToScale: False / DesiredWithinRange— already at min or max replicasScalingActive: False / FailedGetResourceMetric— metrics-server not available or pod has no resource requests setScalingLimited: True / TooManyReplicas— HPA wants more but PDB or anti-affinity is blocking new pods
# Check if metrics-server is running
kubectl get pods -n kube-system | grep metrics-server
# Check current resource usage seen by HPA
kubectl top pods -n <namespace>
Simulate scheduling with dry-run
Check if a pod spec would be schedulable without actually creating it.
kubectl apply -f my-pod.yaml --dry-run=server
Or use the scheduler extender to explain why a pod can't be placed:
# Check if a specific node would accept the pod
kubectl describe pod <pending-pod> | grep -A 20 Events
Key Takeaways
- Pending = filtering eliminated all nodes. The scheduler never reaches scoring.
kubectl describe podwill tell you exactly which filter failed. - Taints are node-level gates. A pod without the right toleration will never land on a tainted node, regardless of available resources.
- Required affinity/anti-affinity is a hard constraint. If the rules can't be satisfied, pods stay Pending. Use
preferredunless you truly need a hard requirement. - Hard pod anti-affinity caps your replica count to the number of matching nodes. Scale your nodes first, or switch to Topology Spread Constraints.
- PDB blocks eviction, not scheduling. A misconfigured PDB (
minAvailable: 100%with one replica) will silently deadlock a node drain or cluster upgrade. - Topology Spread Constraints are the modern replacement for pod anti-affinity spreading. Prefer them for zone/node spreading in new workloads.
- Always set resource requests on pods — HPA cannot function without them, and the scheduler makes poor placement decisions without them.
- HPA + hard anti-affinity creates a hidden replica ceiling equal to your node count. Use
preferredanti-affinity or Topology Spread Constraints so HPA can actually scale. - Use
maxUnavailablein PDB instead ofminAvailablewhen HPA is in use — it scales proportionally with replica count and avoids deadlocks during drain. - KEDA extends HPA to scale on external events (queues, topics, DBs) and supports scale-to-zero — something native HPA cannot do.