Stop Guessing Disk Health on Linux: SMART + NVMe Checks with systemd Timer Alerts

Published: (March 7, 2026 at 12:01 AM EST)
3 min read
Source: Dev.to

Source: Dev.to

Overview

Your backups can be perfect and your services hardened, but if storage health drifts silently you can still lose weekends (and sometimes data).
This guide provides a practical, auditable disk‑health workflow on Linux:

  • Scan ATA/SATA/SAS/NVMe devices
  • Run health checks with smartctl
  • Pull NVMe telemetry with nvme smart‑log
  • Fail loudly in systemd/journald when something is wrong
  • Schedule checks with a persistent timer

No dashboards required—just reliable signals.

Installation

# Debian/Ubuntu
sudo apt update
sudo apt install -y smartmontools nvme-cli jq

# Fedora/CentOS
sudo dnf install -y smartmontools nvme-cli jq
  • smartmontools provides smartctl and smartd.
  • nvme-cli provides the nvme command.
  • jq is used to parse JSON output from nvme smart‑log.

Enumerate Devices

sudo smartctl --scan-open

Typical output:

/dev/sda -d sat # /dev/sda, ATA device
/dev/nvme0 -d nvme # /dev/nvme0, NVMe device

Keep the -d type from the scan output; it avoids ambiguous probing on some controllers.

Health‑Check Script

Save the following as /usr/local/sbin/check-disk-health.sh and make it executable.

#!/usr/bin/env bash
set -euo pipefail

LOG_TAG="disk-health-check"
RC=0

log() {
  systemd-cat -t "$LOG_TAG" echo "$*"
}

# Returns 0 when healthy enough, non‑zero when warning/failure bits are present.
check_smart() {
  local dev="$1"
  local dtype="$2"

  # -H overall health, -A attributes, -l error/selftest logs
  if smartctl -H -A -l error -l selftest -d "$dtype" "$dev" >/tmp/smart-${dev##*/}.log 2>&1; then
    log "OK SMART: $dev ($dtype)"
  else
    local c=$?
    log "WARN SMART: $dev ($dtype) exit=$c"
    log "DETAIL SMART: $(tail -n 5 /tmp/smart-${dev##*/}.log | tr '\n' ' ' | sed 's/  */ /g')"
    RC=1
  fi
}

check_nvme() {
  local dev="$1"

  if out=$(nvme smart-log "$dev" -o json 2>/dev/null); then
    cw=$(printf '%s' "$out" | jq -r '.critical_warning // 0')
    temp_k=$(printf '%s' "$out" | jq -r '.temperature // empty')
    used=$(printf '%s' "$out" | jq -r '.percentage_used // empty')

    if [[ "$cw" != "0" ]]; then
      log "WARN NVMe: $dev critical_warning=$cw percentage_used=${used:-n/a} temperature(K)=${temp_k:-n/a}"
      RC=1
    else
      log "OK NVMe: $dev percentage_used=${used:-n/a} temperature(K)=${temp_k:-n/a}"
    fi
  else
    log "WARN NVMe: failed to read smart-log for $dev"
    RC=1
  fi
}

main() {
  command -v smartctl >/dev/null || { echo "smartctl missing"; exit 2; }
  command -v nvme >/dev/null || { echo "nvme-cli missing"; exit 2; }
  command -v jq >/dev/null || { echo "jq missing (install jq)"; exit 2; }

  mapfile -t scanned
}
  • The script logs results to systemd-journald with the tag disk-health-check.
  • It exits with a non‑zero status if any device reports warnings.

References

  • systemd.timer(5)
  • nvme-smart-log(1)
  • Debian smartmontools package details
  • ArchWiki S.M.A.R.T. operational notes
0 views
Back to Blog

Related posts

Read more »