Stop Guessing Disk Health on Linux: SMART + NVMe Checks with systemd Timer Alerts
Source: Dev.to
Overview
Your backups can be perfect and your services hardened, but if storage health drifts silently you can still lose weekends (and sometimes data).
This guide provides a practical, auditable disk‑health workflow on Linux:
- Scan ATA/SATA/SAS/NVMe devices
- Run health checks with smartctl
- Pull NVMe telemetry with nvme smart‑log
- Fail loudly in systemd/journald when something is wrong
- Schedule checks with a persistent timer
No dashboards required—just reliable signals.
Installation
# Debian/Ubuntu
sudo apt update
sudo apt install -y smartmontools nvme-cli jq
# Fedora/CentOS
sudo dnf install -y smartmontools nvme-cli jq
smartmontoolsprovidessmartctlandsmartd.nvme-cliprovides thenvmecommand.jqis used to parse JSON output fromnvme smart‑log.
Enumerate Devices
sudo smartctl --scan-open
Typical output:
/dev/sda -d sat # /dev/sda, ATA device
/dev/nvme0 -d nvme # /dev/nvme0, NVMe device
Keep the -d type from the scan output; it avoids ambiguous probing on some controllers.
Health‑Check Script
Save the following as /usr/local/sbin/check-disk-health.sh and make it executable.
#!/usr/bin/env bash
set -euo pipefail
LOG_TAG="disk-health-check"
RC=0
log() {
systemd-cat -t "$LOG_TAG" echo "$*"
}
# Returns 0 when healthy enough, non‑zero when warning/failure bits are present.
check_smart() {
local dev="$1"
local dtype="$2"
# -H overall health, -A attributes, -l error/selftest logs
if smartctl -H -A -l error -l selftest -d "$dtype" "$dev" >/tmp/smart-${dev##*/}.log 2>&1; then
log "OK SMART: $dev ($dtype)"
else
local c=$?
log "WARN SMART: $dev ($dtype) exit=$c"
log "DETAIL SMART: $(tail -n 5 /tmp/smart-${dev##*/}.log | tr '\n' ' ' | sed 's/ */ /g')"
RC=1
fi
}
check_nvme() {
local dev="$1"
if out=$(nvme smart-log "$dev" -o json 2>/dev/null); then
cw=$(printf '%s' "$out" | jq -r '.critical_warning // 0')
temp_k=$(printf '%s' "$out" | jq -r '.temperature // empty')
used=$(printf '%s' "$out" | jq -r '.percentage_used // empty')
if [[ "$cw" != "0" ]]; then
log "WARN NVMe: $dev critical_warning=$cw percentage_used=${used:-n/a} temperature(K)=${temp_k:-n/a}"
RC=1
else
log "OK NVMe: $dev percentage_used=${used:-n/a} temperature(K)=${temp_k:-n/a}"
fi
else
log "WARN NVMe: failed to read smart-log for $dev"
RC=1
fi
}
main() {
command -v smartctl >/dev/null || { echo "smartctl missing"; exit 2; }
command -v nvme >/dev/null || { echo "nvme-cli missing"; exit 2; }
command -v jq >/dev/null || { echo "jq missing (install jq)"; exit 2; }
mapfile -t scanned
}
- The script logs results to
systemd-journaldwith the tag disk-health-check. - It exits with a non‑zero status if any device reports warnings.
References
systemd.timer(5)nvme-smart-log(1)- Debian smartmontools package details
- ArchWiki S.M.A.R.T. operational notes