Optimizing PostgreSQL Queries for Large-Scale Data Applications
Source: Dev.to
Introduction
Over the past two years we’ve reduced query times from > 15 seconds to much lower values.
❌ Query time: 18 seconds – killed responsiveness.
2. Indexing Strategies
2.1 B‑tree indexes (equality & range)
-- Single‑column indexes
CREATE INDEX idx_orders_created_at ON orders(created_at);
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
-- Composite index for multi‑column filter
CREATE INDEX idx_orders_customer_date
ON orders(customer_id, created_at);
Impact: Query time ↓ to 8 seconds.
2.2 Partial indexes (index only active rows)
CREATE INDEX idx_orders_active
ON orders(created_at)
WHERE status = 'active';
Reduces index size by ~60 %.
Impact: Query time ↓ to 4.5 seconds.
2.3 GIN index for JSONB columns
CREATE INDEX idx_orders_metadata
ON orders USING GIN (metadata);
-- Fast JSONB lookup
SELECT *
FROM orders
WHERE metadata @> '{"priority": "high"}';
2.4 Monitoring & Maintenance
-- Index usage statistics
SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC;
-- Drop unused indexes
DROP INDEX IF EXISTS unused_index_name;
-- Rebuild bloated indexes concurrently
REINDEX INDEX CONCURRENTLY idx_orders_created_at;
3. Query‑Level Optimizations
3.1 Select only needed columns
-- ❌ Bad
SELECT * FROM orders WHERE status = 'active';
-- ✅ Good
SELECT id, customer_id, total_amount, created_at
FROM orders
WHERE status = 'active';
Impact: ~40 % reduction in data transfer time.
3.2 Explain & Analyze
EXPLAIN (ANALYZE, BUFFERS)
SELECT o.id, o.total_amount, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at >= CURRENT_DATE - INTERVAL '30 days';
Key metrics to watch
| Plan node | Interpretation |
|---|---|
| Seq Scan | Bad – needs an index |
| Index Scan | Good |
| Bitmap Heap Scan | Good for large result sets |
| Nested Loop | Bad for large tables |
| Hash Join | Good for large tables |
3.3 Join type choice
-- ❌ Bad: many LEFT JOINs returning NULLs
SELECT o.*, c.*, p.*, s.*
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.id
LEFT JOIN products p ON o.product_id = p.id
LEFT JOIN shipments s ON o.id = s.order_id;
-- ✅ Good: use INNER JOIN when rows must exist
SELECT o.id, o.total_amount, c.name, p.title
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id
INNER JOIN products p ON o.product_id = p.id
WHERE o.status = 'completed';
4. Partitioning
4.1 Range partitioning by created_at
-- Parent table
CREATE TABLE orders (
id SERIAL,
customer_id INTEGER,
total_amount DECIMAL(10,2),
created_at TIMESTAMP NOT NULL,
status VARCHAR(50)
) PARTITION BY RANGE (created_at);
-- Monthly partitions
CREATE TABLE orders_2024_01 PARTITION OF orders
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
CREATE TABLE orders_2024_02 PARTITION OF orders
FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');
-- Index on each partition
CREATE INDEX idx_orders_2024_01_customer
ON orders_2024_01(customer_id);
Impact: Date‑range scans became 10× faster.
4.2 List partitioning by status
CREATE TABLE orders (
id SERIAL,
customer_id INTEGER,
status VARCHAR(50),
created_at TIMESTAMP
) PARTITION BY LIST (status);
CREATE TABLE orders_active PARTITION OF orders
FOR VALUES IN ('pending', 'processing', 'shipped');
CREATE TABLE orders_completed PARTITION OF orders
FOR VALUES IN ('delivered', 'completed');
5. Connection Pooling (PgBouncer)
pgbouncer.ini
[databases]
myapp_db = host=localhost port=5432 dbname=production
[pgbouncer]
listen_addr = 127.0.0.1
listen_port = 6432
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
reserve_pool_size = 5
6. Node.js Implementation
// db.js
const { Pool } = require('pg');
const pool = new Pool({
host: 'localhost',
port: 6432, // PgBouncer port
database: 'myapp_db',
user: 'app_user',
password: process.env.DB_PASSWORD,
max: 20, // max connections in pool
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 2000,
});
// Efficient query execution
async function getActiveOrders() {
const client = await pool.connect();
try {
const result = await client.query(
'SELECT id, total_amount FROM orders WHERE status = $1',
['active']
);
return result.rows;
} finally {
client.release(); // always release!
}
}
Impact: Connection‑overhead ↓ 70 %.
7. Redis Caching Layer
// cache.js
const redis = require('redis');
const client = redis.createClient();
async function getCustomerOrders(customerId) {
const cacheKey = `customer:${customerId}:orders`;
// 1️⃣ Check cache first
const cached = await client.get(cacheKey);
if (cached) {
return JSON.parse(cached);
}
// 2️⃣ Fallback to DB
const result = await pool.query(
'SELECT * FROM orders WHERE customer_id = $1',
[customerId]
);
// 3️⃣ Cache for 5 minutes
await client.setEx(cacheKey, 300, JSON.stringify(result.rows));
return result.rows;
}
8. Materialized View for Heavy Aggregations
CREATE MATERIALIZED VIEW daily_sales_summary AS
SELECT
DATE(created_at) AS sale_date,
COUNT(*) AS total_orders,
SUM(total_amount) AS total_revenue,
AVG(total_amount) AS avg_order_value
FROM orders
WHERE status = 'completed'
GROUP BY DATE(created_at);
Refresh as needed (e.g., nightly) to keep reporting fast.
TL;DR
- Add appropriate B‑tree, partial, and GIN indexes.
- Select only required columns and use INNER JOIN where possible.
- Partition large tables by date or status.
- Monitor index usage and reindex bloated indexes.
- Pool connections with PgBouncer and cache frequent lookups in Redis.
- Use materialized views for expensive aggregations.
These steps collectively turned a 15‑second query into sub‑200 ms responses, enabling our application to scale gracefully with millions of daily transactions.
Materialized View Indexing
CREATE INDEX idx_daily_sales_date
ON daily_sales_summary(sale_date);
Refresh the view (can be scheduled)
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_sales_summary;
Performance Impact: Dashboard queries went from 12 s to 100 ms.
Bulk Inserts
❌ Bad – Multiple individual inserts
// 10 000 separate INSERTs
for (let i = 0; i 1000
`),
connections: await pool.query(`
SELECT COUNT(*) FROM pg_stat_activity
WHERE state = 'active'
`),
tableSize: await pool.query(`
SELECT pg_size_pretty(pg_database_size(current_database()))
`)
};
console.log('Database Health:', checks);
// Alert if thresholds exceeded
if (parseInt(checks.slowQueries.rows[0].count, 10) > 10) {
error('⚠️ Too many slow queries detected!');
}
}
// Run every 5 minutes
setInterval(healthCheck, 5 * 60 * 1000);
Query Patterns
❌ Bad – N+1 queries
const orders = await pool.query('SELECT * FROM orders LIMIT 100');
for (let order of orders.rows) {
const customer = await pool.query(
'SELECT * FROM customers WHERE id = $1',
[order.customer_id]
);
order.customer = customer.rows[0];
}
✅ Good – Single JOIN
const result = await pool.query(`
SELECT o.*, c.name, c.email
FROM orders o
JOIN customers c ON o.customer_id = c.id
LIMIT 100
`);
❌ Bad – SQL‑injection risk & no plan caching
const query = `SELECT * FROM orders WHERE id = ${userId}`;
✅ Good – Prepared statement
const query = 'SELECT * FROM orders WHERE id = $1';
await pool.query(query, [userId]);
Full‑Text Search Setup
-- Add tsvector column
ALTER TABLE products ADD COLUMN search_vector tsvector;
-- Create GIN index
CREATE INDEX idx_products_search
ON products USING GIN(search_vector);
-- Update trigger to maintain search_vector
CREATE TRIGGER tsvector_update
BEFORE INSERT OR UPDATE ON products
FOR EACH ROW EXECUTE FUNCTION
tsvector_update_trigger(search_vector, 'pg_catalog.english', title, description);
-- Efficient full‑text search
SELECT * FROM products
WHERE search_vector @@ to_tsquery('english', 'laptop & gaming');
High‑Write Workload Tweaks
wal_compression = on
wal_writer_delay = 200ms
commit_delay = 10
commit_siblings = 5
Best‑Practice Checklist
- Measure first: Use
EXPLAIN ANALYZEbefore optimizing. - Index wisely: More indexes ≠ better performance.
- Monitor continuously:
pg_stat_statementsis your best friend. - Test in production‑like environments: Local DB won’t reveal scaling issues.
- Apply incremental changes: Optimize one thing at a time.
- Document everything: Future you will thank present you.
Optimizing PostgreSQL for large‑scale applications is an iterative process. The techniques shared here reduced query times by 100× and cut infrastructure costs by 60 %.
Key Takeaways
- ✅ Strategic indexing (not over‑indexing)
- ✅ Query restructuring & proper JOINs
- ✅ Partitioning for time‑series data
- ✅ Connection pooling with PgBouncer
- ✅ Caching at multiple levels
- ✅ Regular monitoring & maintenance
Tools: pgAdmin (GUI management), pg_stat_statements
- Query statistics
- PgBouncer – Connection pooling
- pgBadger – Log analyzer
- explain.depesz.com – Visual EXPLAIN analyzer
Have you faced similar scaling challenges? What optimization techniques worked for your use case? Let me know in the comments!
Tags: performance #optimization #sql #backend #devops #scalability