How I Troubleshot a KVM Memory Issue That Led to Swap & High CPU (Runbook + Real Scenario)
Source: Dev.to
Recently, I noticed something strange on one of my KVM hypervisors
The server wasn’t heavily loaded, but earlier I saw:
qemu-system-x86consuming 800 %+ CPUkswapdrunning hot- Swap usage near 100 %
Later:
- CPU was low
- RAM had plenty free
- Swap was still full
Below is the exact troubleshooting flow I followed — and how you can do the same.
🧠 Environment Context
- Hypervisor: KVM + libvirt
- Host RAM: 314 GB
- Swap: 976 MB
- Multiple VMs running
- Problem VM:
testnet-node3
🔍 Step 1 – Identify High‑CPU Process
ps -eo pid,comm,%cpu,%mem --sort=-%cpu | head -n 10
Output (excerpt)
qemu-system-x86 818%
Note: In Linux, 100 % = 1 CPU core, so 800 % ≈ 8 cores fully used. One VM was heavily consuming CPU.
🔎 Step 2 – Map the PID to a VM
Each VM runs as a qemu-system-x86 process.
ps -fp <PID> # Show the command line for the PID
virsh list --all # List all VMs
virsh dominfo <VM> # Show details for a specific VM
Using the commands above I identified the offending VM as testnet-node3.
📊 Step 3 – Check Host Memory & Swap
free -h
Mem: 314Gi total
217Gi used
94Gi free
Swap: 976Mi total
963Mi used
Swap was 98 % used while RAM still had 94 GB free – a situation that often causes panic.
🧪 Step 4 – Verify Active Memory Pressure
vmstat 1 5
Focus on the columns:
| Column | Meaning |
|---|---|
si | pages swapped in |
so | pages swapped out |
If both are 0, the system is not under active memory pressure.
My output
si = 0
so = 0
Interpretation:
The swap usage was historical; the kernel had swapped out pages earlier, but no current pressure exists.
🔥 Why Swap Can Stay Full Even With Free RAM
Linux does not automatically move swapped pages back into RAM unless they are needed. A previous memory‑pressure event can leave swap full long after RAM becomes available – this is normal behavior.
🧠 Step 5 – Inspect VM Memory Allocation
virsh dominfo testnet-node3
Max memory: 98304000 KiB
98304000 KiB ≈ 94 GB → the VM had ~94 GB allocated.
❓ Was the VM Actually Memory‑Starved?
Check inside the guest:
free -h
vmstat 1 5
If the guest shows:
- Swap used
- OOM‑killer messages
- Memory > 90 % used
then increasing RAM makes sense. If not, the high CPU may be workload‑related rather than a memory shortage.
🚀 Step 6 – Increase VM RAM Safely
The VM was stopped, target RAM = 128 GB.
# 128 GB in KiB
128 * 1024 * 1024 = 134217728 KiB
virsh setmaxmem testnet-node3 128G --config
virsh setmem testnet-node3 128G --config
Verify:
virsh dominfo testnet-node3
Start the VM:
virsh start testnet-node3
📊 Step 7 – Verify Host Stability After Resize
free -h
Mem: 314Gi total
221Gi used
89Gi free
Swap: 0B used
Swap cleared.
vmstat 1 5
si = 0, so = 0, CPU idle high → system healthy.
🧩 Root‑Cause Pattern
- VM workload spikes → guest consumes heavy memory
- Host experiences memory pressure → swap fills,
kswapdCPU spikes qemuprocess shows high CPU (driven by the guest)- After workload stabilises, swap remains full (no active pressure)
Without checking vmstat, many misdiagnose the issue.
🛑 Common Mistakes
- ❌ Increasing host RAM without checking guest usage
- ❌ Assuming 100 % swap = a dying system
- ❌ Ignoring
vmstatoutput - ❌ Allocating 100 % of host RAM to VMs
📐 Capacity‑Planning Rule for KVM Hosts
- On large‑memory hosts (e.g., 314 GB), leave 16–32 GB for the host OS.
- Never allocate 100 % of RAM to guests.
- Monitor swap regularly; a small swap (1–4 GB) is sufficient for large‑RAM systems.
🧠 Pro Tips
# List total memory allocation for all VMs
virsh list --name | while read vm; do
virsh dominfo "$vm" | grep -i memory
done
# See if swapping is active
vmstat 1
# Find the process that consumes most memory
ps -eo pid,comm,%mem,%cpu --sort=-%mem | head
🎯 Final Takeaway
Swap usage alone does NOT equal a memory problem.
Key indicators of a real issue are:
- Active swap in/out (
vmstat) - OOM events in the host or guest logs
- Sustained high CPU from
kswapd - Guest‑level memory pressure
In my case, the VM’s memory allocation was already sufficient; the high CPU was workload‑related, and the lingering swap was simply a remnant of a past pressure event.
Memory Upgrade Summary
Scaled from 94 GB → 128 GB
- Host remained healthy
- No swap pressure
- System stable
If you’re running KVM in production, understanding this memory + swap + CPU interaction is critical.
Blindly adding RAM is easy.
Diagnosing correctly is what makes you a good systems engineer.