New Relic - CPU usage (%) and Load Average

Published: (December 3, 2025 at 06:39 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

Background

At AWS re:Invent 2025 I gave a brief demonstration at the New Relic booth. In Linux environments administrators often rely on CPU usage and load‑average metrics to decide whether an instance is appropriately sized. An oversized instance that sits idle wastes resources and drives up the cloud bill, while an undersized instance pushed to its limits can degrade application performance and impact revenue.

Demo Setup

  1. CPU‑bound task – started with the yes command, which continuously outputs the character “y”.

  2. I/O‑bound workload – ran fio with 32 threads performing synchronous, direct I/O disk writes:

    fio --rw=write --ioengine=psync --direct=1 --bs=1M \
        --numjobs=32 --name=test --filename=/tmp/x \
        --size=10G --thread

Both tasks displayed 100 % CPU usage on the New Relic Summary tab, even though the yes command should saturate the CPU and the fio workload should be I/O‑bound.

Observations

  • The load‑average graph showed 32 threads on the CPU but provided no further detail.
  • The Process tab listed yes with low CPU usage and did not show fio.
  • The missing detail was the process state:
    • R (running) – real CPU usage (e.g., the 12.5 % shown for yes).
    • D (uninterruptible) – processes waiting on I/O; they do not consume CPU even though they contribute to load average.

Understanding CPU Metrics

Opening the query editor and replacing cpuPercent with the individual metrics gives a clearer picture:

SELECT average(cpuUserPercent),
       average(cpuSystemPercent),   -- running
       average(cpuStealPercent),   -- hypervisor
       average(cpuIdlePercent),
       average(cpuIOWaitPercent)   -- idle (incl. wait I/O)
FROM SystemSample
WHERE entityGuid = 'NzM2NzA3MXxJTkZSQXxOQXw0MDAyNTY2MTYyODgwNDkyMzM0'
TIMESERIES AUTO SINCE 5 minutes ago UNTIL now
  • The 100 % CPU usage for the yes command was reported as Steal because the VM was over‑provisioned and the hypervisor allocated only a quarter of the CPU cycles.
  • The 100 % CPU usage for fio appeared as IO Wait while the workload waited for I/O completion rather than actually running on the CPU.

When yes was started again while fio was still running, the IO Wait metric disappeared, confirming that IO Wait is accounted only while the CPU is idle and a process is blocked on an I/O call. If another process runs, the IO Wait time is no longer counted.

Processor State vs. Process State

Processor stateMetricDescription
R (running)cpuUserPercent / cpuSystemPercentProcess executing in user space or kernel space.
StealcpuStealPercentCPU cycles taken by the hypervisor (virtualized environments).
D (uninterruptible)cpuIOWaitPercentProcess blocked on I/O; CPU is idle.
IdlecpuIdlePercentNo runnable processes; CPU idle.

Load‑Average Details

The Linux kernel computes the global load average as an exponentially decaying average of:

nr_running + nr_uninterruptible

Excerpt from kernel/sched/loadavg.c:

/* The global load average is an exponentially decaying average of
 * nr_running + nr_uninterruptible.
 *
 * Once every LOAD_FREQ:
 *
 *   nr_active = 0;
 *   for_each_possible_cpu(cpu)
 *       nr_active += cpu_of(cpu)->nr_running + cpu_of(cpu)->nr_uninterruptible;
 *
 *   avenrun[n] = avenrun[0] * exp_n + nr_active * (1 - exp_n);
 */

A comment in the same file humorously notes:

/*
 * kernel/sched/loadavg.c
 *
 * This file contains the magic bits required to compute the global loadavg
 * figure. It's a silly number but people think it's important. We go through
 * great pains to make it work on big machines and tickless kernels.
 */

Historically, load average (originating in the 1970s) reflected how many processes were in the run queue. On a single‑CPU system, a load average of 1 meant one process running or waiting, which people equated with 100 % CPU usage. Modern multi‑core, tickless kernels make this metric far less reliable as an indicator of real‑time CPU utilization.

Uninterruptible State (D) Is Not Always Disk‑I/O

  • Asynchronous disk I/O that collects completed operations does not enter the D state.
  • Changing the fio job from psync to async produced identical throughput but reduced IO Wait and lowered the load average.
  • Some system calls appear as IO Wait and increase load average even when they are harmless (e.g., launching many processes that immediately sleep).

Recommendations for New Relic Dashboard

  1. Replace cpuPercent with the detailed metrics:

    • cpuUserPercent and cpuSystemPercent – actual work done by applications and the kernel.
    • cpuStealPercent – time stolen by the hypervisor (virtualized environments).
    • cpuIOWaitPercent – CPU idle while processes wait on I/O.
    • cpuIdlePercent – true idle time.
  2. Interpret load average with caution – it includes both running and uninterruptible tasks and does not directly map to CPU usage on modern systems.

  3. Focus on user and system CPU percentages when sizing instances, especially in cloud environments where minimizing idle CPU time reduces cost.

    • A 100 % CPU usage shown for an I/O‑bound workload (as in the fio demo) does not indicate that the instance is undersized; it often reflects high IO wait rather than actual CPU saturation.

By understanding the distinction between processor states and process states, and by using the granular New Relic metrics, you can make more accurate capacity‑planning decisions and avoid misinterpreting load‑average or CPU‑usage graphs.

Back to Blog

Related posts

Read more »