Part 5: From One Server to Many - The Need for Orchestration

Published: 1 month ago (December 22, 2025 at 06:26 PM EST)

7 min read

Source: Dev.to

Series: From “Just Put It on a Server” to Production DevOps
Reading time: 11 minutes
Level: Intermediate

The Production Reality Check

Your SSPP platform is live! Docker Compose works beautifully on your local machine and even on your single production server.

Then Black Friday hits. Traffic spikes 50×.

What do you do?

You can’t just run

docker‑compose up --scale worker=50

because:

One server doesn’t have 50× the resources.
The database would be overwhelmed.
You’d need multiple servers.

So you start manually

# Rent 5 more Linode servers
# SSH into each one
# Install Docker on each
# Copy docker‑compose.yml to each
# Modify each to avoid port conflicts
# Start containers manually
# Configure a load balancer somehow
# Hope nothing breaks

Time to scale: 3‑4 hours (if you’re fast and lucky)

By the time you’re done, Black Friday is over.

Failure Scenario 1 – Container Crashes

Simulate a production crash

# Start your stack
docker-compose up -d

# Kill the API container
docker kill sspp-api

What happens?

The API is dead. Docker Compose doesn’t restart it automatically.

docker-compose ps

NAME            STATE
sspp-api        Exited (137)
sspp-worker     Up
sspp-postgres   Up
sspp-redis      Up

Users see 500 errors. Your on‑call phone explodes. 📱💥

Manual fix

docker-compose up -d api

Downtime: 2–10 minutes (detection + SSH + restart)

In a production system you need automatic recovery.

Failure Scenario 2 – Server Crashes

Even worse—the entire server goes down:

# Simulate server crash (don’t actually run this!)
sudo reboot -f

What happens?

API: dead
Worker: dead
PostgreSQL: dead (data persisted in volumes, but service down)
Redis queue: empty (all jobs lost)
Users: angry

Manual recovery

# Wait for server to boot (~2 minutes)
# SSH in
docker‑compose up -d
# Wait for services to start (~30 seconds)
# Hope data is intact

Downtime: 3‑5 minutes minimum
Lost data: All queued jobs

Failure Scenario 3 – Rolling Update Gone Wrong

You need to deploy a critical bug fix:

# Build new image
docker-compose build api

# Restart with new image
docker-compose up -d api

What happens?

Old API container stops (connections dropped)
New API container starts
5–30 seconds of downtime while it boots

If the new version has a bug, you must manually rollback.

The deployment strategy you’re using

No blue/green deployment
No canary releases
No gradual rollout
Just… restart and pray 🙏

Failure Scenario 4 – Manual Scaling Nightmare

Traffic is increasing. You need 5 API instances spread across 3 servers.

Server 1 (`docker‑compose.yml`)

services:
  api:
    ports:
      - "3000:3000"   # Occupies port 3000

Server 2 (`docker‑compose.yml`)

services:
  api:
    ports:
      - "3000:3000"   # Same port—works because it’s a different server

How do users reach them?

You need a load balancer to expose a single entry point:

         ┌──────────────┐
         │ Load Balancer│
         │   (HAProxy?) │
         └───────┬──────┘
                 │
   ┌─────────────┼─────────────┐
   ▼             ▼             ▼
Server 1      Server 2      Server 3
API:3000      API:3000      API:3000

Manual steps required

Step	Description
1️⃣	Install HAProxy (or another reverse‑proxy) on a dedicated node
2️⃣	Write a HAProxy configuration that includes health‑checks for each API instance
3️⃣	Add the IP addresses of all API servers manually to the config
4️⃣	Reload/restart HAProxy whenever a server is added, removed, or its IP changes
5️⃣	Configure SSL termination (certificates, SNI, etc.)
6️⃣	Set up monitoring/alerting for HAProxy health and backend availability

Cost of this approach

Time to set up: 2 – 4 hours
Maintenance burden: High (every change requires manual config edits and a reload)
Error‑prone: Very (typos, forgotten servers, stale health‑check settings)

Failure Scenario 5 – Database Connection Limits

Your PostgreSQL server has a max_connections limit (default = 100).

Why it fails

Component	Instances	Connections per instance	Total connections
API	10	10	100
Worker	10	10	100
Overall	—	—	200

Max allowed: 100
Result: Half of the containers cannot obtain a database connection.

Manual fix

Configure connection pooling in each service (e.g., PgBouncer, HikariCP).
Increase PostgreSQL max_connections to a value that comfortably exceeds the expected total (e.g., 250‑300).
Restart the database and all services to apply the new settings.
Verify the connection count with SELECT count(*) FROM pg_stat_activity; and ensure no errors appear.

What You Need (But Don’t Have)

Feature	Why It Matters
Self‑healing	Automatically restart failed containers
Auto‑scaling	Add/remove instances based on load
Load balancing	Distribute traffic across instances
Service discovery	Containers find each other dynamically
Rolling updates	Deploy without downtime
Rollback capability	Revert bad deployments instantly
Health checks	Prevent routing traffic to unhealthy containers
Resource limits	Stop a container from starving others
Secrets management	Keep passwords and keys out of plain text
Multi‑server orchestration	Run workloads across many machines

Note: Docker Compose provides none of these capabilities in a production environment.

The Orchestration Gap

Docker Compose is great for development:

Single‑server setups
Manual start/stop of services
Simple networking
Fast iteration cycles

But it falls short for production workloads:

No multi‑server (cluster) support
No automatic recovery or self‑healing
No built‑in scaling logic
No deployment strategies (e.g., blue‑green, canary)
No resource‑management (CPU, memory limits)
No production‑grade networking features

You’ve hit the orchestration wall.

The Emotional Journey

Stage	Emotion	Quote
1 – Denial	Denial	“It works locally, so it will work in prod.”
2 – Frustration	Anger / Frustration	“Why does every tiny change break everything? Why is this so hard?! I just want to run containers!”
3 – Realization	Bargaining / Realization	“I need a proper orchestrator.” “Maybe I can script this with Bash and cron jobs?”
4 – Action	Depression → Acceptance	“I’m spending 80 % of my time managing infrastructure, 20 % building features.” “Time to adopt Kubernetes / Nomad / Swarm.”
5 – Acceptance	Acceptance	“I need an orchestrator. I need Kubernetes.”

Summary

Denial – “Docker Compose works fine. I’ll just run it on a big server.”
Frustration / Anger – “Why is this so hard?! I just want to run containers!”
Bargaining / Realization – “Maybe I can script this with Bash and cron jobs?”
Depression – “I’m spending 80 % of my time managing infrastructure, 20 % building features.”
Acceptance – “I need an orchestrator. I need Kubernetes.”

Why Kubernetes Exists

Problem	Docker Compose	Kubernetes
Auto‑restart	❌ Manual	✅ Automatic
Multi‑server	❌ Single server	✅ Cluster of servers
Load balancing	❌ Manual HAProxy	✅ Built‑in Service
Scaling	❌ Manual `--scale`	✅ Auto‑scaling (HPA)
Rolling updates	❌ Restart (downtime)	✅ Zero‑downtime
Rollback	❌ Manual	✅ One command
Health checks	⚠️ Basic	✅ Advanced (liveness, readiness)
Secrets	❌ Plain text	✅ Encrypted
Resource limits	⚠️ Basic	✅ Fine‑grained
Service discovery	⚠️ DNS‑based	✅ Dynamic

Kubernetes is Docker Compose for production, multiplied by 1000.

But Why Not… [Alternative]?

“Why not Docker Swarm?”

Smaller ecosystem – fewer third‑party tools and extensions.
Limited features – no Horizontal Pod Autoscaling (HPA), limited RBAC.
Lower adoption – most modern tools target Kubernetes.
De‑prioritized by Docker Inc.

Typical use case: Small teams running simple applications.

“Why not managed services (AWS ECS, Cloud Run)?”

Vendor lock‑in – moving workloads to another provider is hard.
Limited customization – you can’t tweak the underlying platform.
Higher cost at scale – pay‑as‑you‑go pricing can become expensive.
Not portable – cannot run the same workload locally without the provider’s stack.

Typical use case: Organizations fully committed to a single cloud provider.

“Why not Nomad?”

Smaller community – fewer community resources and examples.
Fewer integrations – limited out‑of‑the‑box support for many services.
Less tooling – ecosystem not as mature as Kubernetes’.
Hiring challenges – fewer engineers with Nomad experience.

Typical use case: Teams already invested in the HashiCorp ecosystem (Terraform, Vault, Consul).

“Why Kubernetes?”

Industry standard – most job listings require Kubernetes expertise.
Huge ecosystem – tools for monitoring, CI/CD, security, networking, etc.
Cloud‑agnostic – runs on AWS, GCP, Azure, Linode, on‑prem, and edge.
Local development – lightweight options like k3s, Minikube, Kind.
Portability – the same manifests work everywhere.

Bottom line: Kubernetes won the orchestration war.

What You’ll Learn in Part 6

In the next article we’ll deploy SSPP to Kubernetes on Linode.
But we won’t just throw kubectl commands at you.

We’ll cover:

What Pods, Deployments, and Services actually are
Why Kubernetes seems complicated (and how to think about it)
How to run Kubernetes locally (k3s) before going to production
Real deployment strategies (rolling updates, blue/green)
How our SSPP manifests work

No magic. No copy‑paste. Just understanding.

The Mindset Shift

Before Kubernetes	After Kubernetes
“I have a server. I’ll put containers on it.”	“I have a cluster. I’ll declare what I want running. Kubernetes makes it happen.”

Declarative infrastructure

# You say what you want
apiVersion: apps/v1
kind: Deployment
spec:
  replicas: 5   # I want 5 API instances

Kubernetes then:

Schedules 5 Pods
Distributes them across nodes
Monitors them
Restarts them if they die
Scales up/down dynamically

You describe the desired state. Kubernetes maintains it.

Try It Yourself (Before Part 6)

Challenge: Break Docker Compose in creative ways:

Kill containers – see if they restart (they won’t)
Overload the API – see if it auto‑scales (it won’t)
Deploy a new version – see if there’s downtime (there will be)
Simulate high CPU – see if K8s would help (it would)

Write down your frustrations. They’ll make Part 6 more satisfying.

Discussion

What production incident convinced you that you needed orchestration?
Share your war stories on GitHub Discussions.

Previous: Part 4 – Running Multiple Services Locally with Docker Compose
Next: Part 6 – Kubernetes from First Principles (No Magic)

About the Author

Documenting a real DevOps journey for the Proton.ai application. Connect with me:

GitHub: @daviesbrown
LinkedIn: David Nwosu Brown

The Production Reality Check

What do you do?

So you start manually

Failure Scenario 1 – Container Crashes

What happens?

Manual fix

Failure Scenario 2 – Server Crashes

What happens?

Manual recovery

Failure Scenario 3 – Rolling Update Gone Wrong

What happens?

The deployment strategy you’re using

Failure Scenario 4 – Manual Scaling Nightmare

Server 1 (docker‑compose.yml)

Server 2 (docker‑compose.yml)

How do users reach them?

Manual steps required

Cost of this approach

Failure Scenario 5 – Database Connection Limits

Why it fails

Manual fix

What You Need (But Don’t Have)

The Orchestration Gap

The Emotional Journey

Summary

Why Kubernetes Exists

But Why Not… [Alternative]?

“Why not Docker Swarm?”

“Why not managed services (AWS ECS, Cloud Run)?”

“Why not Nomad?”

“Why Kubernetes?”

What You’ll Learn in Part 6

The Mindset Shift

Declarative infrastructure

Try It Yourself (Before Part 6)

Discussion

About the Author

Related posts

Kubernetes Journey Part 1: Why Docker?

How Workers powers our internal maintenance scheduling pipeline

ECR can create automaticaly when image pushed

The Art of Small Images: Practical Techniques for Shaving Hundreds of MB Off AI and Java Containers

Failure Scenario 1 – Container Crashes

Failure Scenario 2 – Server Crashes

Failure Scenario 3 – Rolling Update Gone Wrong

Failure Scenario 4 – Manual Scaling Nightmare

Server 1 (`docker‑compose.yml`)

Server 2 (`docker‑compose.yml`)

Failure Scenario 5 – Database Connection Limits

“Why not managed services (AWS ECS, Cloud Run)?”

What You’ll Learn in Part 6

Try It Yourself (Before Part 6)