Database Backup Fidelity: Why Crash-Consistent Is Not a Database Backup
Source: Dev.to
App‑Consistent vs. Crash‑Consistent Database Backups
App‑consistent database backup is the difference between a recoverable database and a recovery event that fails under pressure.
Backup policies are designed by architects and discovered by engineers during recovery. Most enterprise environments have backup schedules running, retention policies configured, and dashboards showing green. What most have never validated is the consistency level those backups are actually capturing.
That question gets answered — usually under pressure — when a DBA attempts to restore a production database and discovers the backup represents a storage snapshot taken mid‑transaction.
The two consistency models
| Aspect | Crash‑Consistent | App‑Consistent |
|---|---|---|
| When the copy is taken | No coordination with the database engine; the snapshot fires on whatever is on disk at that moment. | The database engine is quiesced first – buffer pool flushed, in‑flight transactions completed or rolled back, writes paused. |
| State of the data | Open transactions are mid‑flight; WAL may not be flushed; dirty pages may still be in memory. | The database is in a known‑good state; no dirty pages or uncommitted work remain. |
| Recovery requirement | Relies on the engine’s crash‑recovery mechanisms (WAL replay, redo/undo, log rollback). | No special recovery mechanisms needed – the database mounts cleanly. |
| Risk location | Recovery risk is shifted to restore time. | Risk is mitigated at backup time. |
| Typical tooling | VM‑level snapshots (hypervisor) that capture the whole VM without inside knowledge. | VSS (Windows) or pre/post‑freeze scripts (Linux) plus a database‑aware agent. |
Bottom line: Crash‑consistent backups look complete from a storage perspective but may be incomplete from a database perspective. App‑consistent backups are complete from both perspectives.
Why most environments end up with crash‑consistent backups
- VM snapshot tooling prioritises speed – hypervisor snapshots capture the entire VM without knowledge of what’s running inside.
- Backup vendors optimise for coverage – a single policy covering hundreds of VMs is attractive, but it is applied uniformly to both application and database VMs, producing crash‑consistent backups for the databases.
- Integration work is deferred – installing agents, configuring credentials, and validating quiesce steps are seen as “nice‑to‑have” and get postponed under time pressure.
- Operators rely on transaction logs – they assume the logs will fill the gap, but this transfers risk rather than eliminating it. The assumption fails when the log chain is broken, logs reside on a separate volume not captured by the snapshot, or the recovery environment runs a different engine version.
What the dashboard does and doesn’t show
| Item | ✅ Shows | ❓ Does Not Show |
|---|---|---|
| Backups | Running | Consistency level |
| Schedule | Configured | Whether quiescing was triggered |
| Retention | Set | Whether transaction logs are included |
| Last Job | Successful | Whether the agent is active and connected |
| Failures | None (last 60 days) | Whether a restore has ever been tested |
The dashboard measures job completion, not recoverability.
Five questions that determine whether a database backup is actually recoverable
- Does the backup trigger database quiescing?
- Is a database agent installed and active?
- Are transaction logs included in the backup?
- Is application‑aware backup confirmed in the job log – not just configured in the policy?
- Have restores been tested at the database layer?
Engine‑specific crash‑consistent behaviour
| Engine | Crash‑Consistent Behaviour | Recovery Dependency | Risk |
|---|---|---|---|
| SQL Server | Data files captured mid‑transaction. Crash recovery runs on attach – rolls back uncommitted work, replays committed work from the log. | Transaction log must be intact. If logs are on a separate volume not in the snapshot, recovery fails. | Medium (logs included) / High (logs separate) |
| PostgreSQL | Heap files and WAL may be inconsistent. WAL replay runs on startup. | WAL files must be complete from the snapshot point. Missing segments = unrecoverable. | High |
| MySQL / MariaDB | InnoDB buffer pool not flushed → dirty pages captured. InnoDB crash recovery runs on startup. | InnoDB redo log must be present. MyISAM tables will be inconsistent and require manual repair. | Medium (InnoDB‑only) / High (mixed engine) |
| Oracle | Datafiles captured without RMAN coordination. Instance recovery runs on startup using redo logs. | All redo‑log members must be present. RMAN not invoked breaks the recovery catalog. | High |
| MongoDB | WiredTiger journal not synced. Journal replay runs on startup. | Journal files must be intact. Replica resync may be required if replay fails. | Medium |
Scenario comparison
| Scenario | Crash‑Consistent | App‑Consistent |
|---|---|---|
| Full VM restore, database attach | Crash recovery runs. May succeed or fail depending on log integrity → unpredictable. | Mounts cleanly. Predictable restore time. |
| Point‑in‑time recovery required | Requires an unbroken log chain from snapshot to target. Any gap makes PITR impossible. | Clean base + log chain. Reliable if log backups are configured. |
| Log files on separate volume, not in snapshot | ❌ Recovery fails. Database un‑attachable. | Not applicable – app‑consistent includes all required files. |
| Ransomware recovery | Recovery state uncertain. Integrity validation extends window. | Known‑good state. Deterministic recovery. |
| App‑aware processing silently failed at backup | ❌ Operator discovers crash‑consistent backup during recovery. No warning was issued. | Agent failure surfaces as a job warning – visible, not silent. |
| Recovery to a different engine version | ❌ Crash recovery behaviour varies between versions. May fail on target. | Standard restore procedures apply. |
Common tooling & commands
Windows – VSS
vssadmin list writersVerify that the database writer reports “Stable”.
Linux – Pre/Post‑Freeze Scripts
# Pre‑freeze (quiesce) systemctl stop mysql # example for MySQL # Post‑freeze (resume) systemctl start mysqlDatabase Agents – Install the vendor‑provided agent, configure credentials, and enable the “application‑aware” option in the backup policy.
Takeaway
- Crash‑consistent backups shift risk to restore time.
- App‑consistent backups eliminate that risk at backup time, but require deliberate integration.
If you’re still relying on “the backup job ran successfully” as proof of safety, you’re missing the most critical piece of the puzzle: recoverability. Validate quiescing, agent health, log inclusion, and test restores – every single time.
Backup Frequency
Restore Testing
Backup policies are designed by architects. They are discovered by engineers during recovery.
- Crash‑consistent backups are not wrong — they are appropriate for stateless workloads and as a fallback when app‑consistent integration is not feasible.
- However, they are not appropriate as the default strategy for production databases where recovery time, recovery point, and data integrity are defined requirements.
The shift to app‑consistent database backup is not a technology problem. Every enterprise backup platform supports it. It is an integration and validation problem — one that requires deliberate configuration, agent deployment, and restore testing to confirm that what the dashboard shows as protected is actually recoverable.
The five questions in the checklist above exist because the dashboard cannot answer them. Ask them before a recovery event, not during one.
Originally published at rack2cloud.com.