Common Mistakes Enterprises Make with Cloud Storage and How to Avoid Them
Source: Dev.to
Over and over, I see big enterprises burn money, tank performance, or create compliance nightmares because they treat cloud storage like a magic infinite disk. It isn’t. It’s a toolbox. And if you use a hammer for everything, eventually you’re going to hit your thumb. Here are the most common mistakes I see, and how I’d avoid them if I were rebuilding from scratch.
1. Treating cloud storage like an on‑prem SAN
What I do instead
- Use object storage as the default for anything that is:
- Shared across teams
- Read‑heavy
- Long‑lived
- Reserve block storage for latency‑sensitive, tightly coupled workloads (e.g., databases, certain legacy apps).
- If you catch yourself putting “everything” on block storage, that’s a red flag that you’re re‑implementing the old world in the cloud.
2. Keeping everything in the hottest (most expensive) tier
How to avoid it
- Classify data into hot / warm / cold / archive tiers.
- Apply automated lifecycle policies on every bucket by default:
# Example lifecycle policy (pseudo‑YAML)
rules:
- action: transition
days: X # after X days → cool tier
storageClass: COOL
- action: transition
days: Y # after Y days → archive tier
storageClass: ARCHIVE
- action: delete
days: Z # optional: delete after Z days
- Only exempt datasets where you can actively justify why they must stay hot.
- Rule of thumb: if no one can name a reason a dataset must be hot within 5 seconds, it probably shouldn’t be.
3. Ignoring egress and API costs
How I avoid this
- Co‑locate compute and storage in the same region by default.
- For high‑I/O workloads, shard small files into larger objects (e.g., WebDataset, TAR, Parquet).
- Use caching:
Local NVMe or node‑local SSDs as a read‑through cache for frequently accessed datasets.
- Set up cost dashboards that surface:
- Top egress sources
- Top buckets by API requests
If you don’t measure egress and API calls, you’ll be surprised. Cloud surprise is always expensive.
4. No data locality strategy for performance‑critical workloads
My rule
- Data and compute must live as close as physically possible.
For big training workloads
- Keep canonical data in object storage in the same region.
- Stage active shards onto local NVMe before the job starts.
For critical real‑time inference
- Keep models and key features on local SSD / high‑performance block storage.
If you’re paying for high‑end GPUs, it’s almost always cheaper to over‑provision fast storage than to let those GPUs idle waiting for bytes.
5. Over‑sharing and under‑governing buckets
How I handle it
- Design for data domains, not “one bucket to rule them all”:
analytics-,ml-,raw-,archive-, etc.
- Assign clear ownership per bucket/domain:
- Data owner
- Access‑policy owner
- Lifecycle‑policy owner
- Use least‑privilege IAM:
- Read‑only where possible
- Narrow write permissions
- Strong separation between production and experiment buckets
Security teams love this. So do auditors. More importantly, it reduces accidents.
6. No versioning, no backups, no restore tests
My practical approach
- Turn on versioning for any bucket storing production models, configs, or critical reference data.
- Define a clear replication / backup story:
- Cross‑region replication for “if this region dies, we’re in trouble” datasets.
- Separate “backup projects/accounts” to isolate from accidental deletion.
- Actually test restores:
- Pull a random dataset from backup.
- Time how long it takes and note any breakages.
If you’ve never practiced a restore, assume it doesn’t work.
7. Letting everyone do “whatever they want” forever
What I recommend
- Create a small set of storage patterns:
- “Analytics dataset pattern”
- “ML training dataset pattern”
- “Archive pattern”
- Provide templates and tooling:
- Terraform modules, bucket‑naming conventions, lifecycle defaults.
- Allow deviations—but make them explicit decisions, not accidents.
The goal isn’t central control for its own sake; it’s to avoid having 20 ways to do the same thing, all slightly broken in different ways.
Bringing it together
- Where does your data live?
- Who owns which buckets?
- What are your lifecycle policies?
- How often do you move or restore data?
Most “GPU performance issues” I see are really storage‑design issues in disguise. If you treat cloud storage as a strategic system (classify data, control access, manage lifecycle, test restores, and care about locality), you’ll get better security, lower bills, and much happier GPUs.