Scale From Zero To Millions Of Users
This comprehensive guide explores the architectural evolution required to scale a system from zero to millions of users. We’ll examine real-world capacity planning, specific technologies, bottleneck analysis, and production considerations used at companies like Meta, Netflix, and Amazon. Each stage includes concrete numbers, technology choices, and tradeoffs that staff engineers encounter in production environments.
Scaling Fundamentals
Scaling refers to the ability of a system to handle growing workloads while maintaining performance, availability, and cost efficiency. There are two primary approaches:
Vertical Scaling (Scale Up):
- Increasing single server capacity (CPU, RAM, disk, network)
- Typical progression: 2 cores/4GB → 8 cores/32GB → 96 cores/768GB
- AWS EC2 limits: up to 448 vCPUs and 24TB RAM (u-24tb1.metal)
- Advantages: Simple, no code changes, strong consistency
- Disadvantages: Hardware limits (ceiling around $50K/month), single point of failure, downtime during upgrades
- Cost: Linear to exponential (doubling RAM costs 2-3x more)
Horizontal Scaling (Scale Out):
- Adding more servers to distribute load
- Typical progression: 1 → 3 → 10 → 100 → 1000+ servers
- Advantages: No theoretical limit, fault tolerance, cost-effective at scale
- Disadvantages: Complex architecture, eventual consistency, network overhead
- Cost: Linear (2x servers = 2x cost, but handles 2x+ traffic with efficiency)
Key Principle: Use vertical scaling initially (simpler), transition to horizontal scaling as you approach 10K+ concurrent users.
Stage 1: Single Server (0-1K Users)
Architecture
┌─────────────────────────────────┐
│ Single Server (4 cores) │
│ ┌──────────────────────────┐ │
│ │ Web Server (Nginx) │ │
│ │ App Server (Node.js) │ │
│ │ Database (PostgreSQL) │ │
│ │ Static Files │ │
│ └──────────────────────────┘ │
└─────────────────────────────────┘
Capacity Analysis
- Hardware: 4 vCPUs, 16GB RAM, 200GB SSD
- Cost: $80-160/month (AWS t3.xlarge, GCP n2-standard-4)
- Capacity: 10-50 req/sec, 1K daily active users
- Database: Single PostgreSQL instance, 10-20GB data
- Response Time: 50-200ms average
Technology Stack
- Web Server: Nginx (handles 10K connections per core)
- Application: Node.js, Python Django, Ruby Rails
- Database: PostgreSQL (ACID compliance) or MySQL
- OS: Ubuntu 22.04 LTS, Amazon Linux 2023
Bottlenecks
- CPU contention between app and DB during traffic spikes
- Disk I/O becomes bottleneck at 1K+ writes/sec
- Memory pressure when DB + app exceed 16GB
- Network limits on single network interface (1-5 Gbps)
When to Evolve
- Response times consistently exceed 500ms
- CPU utilization sustained above 70%
- Database queries taking >100ms
- Downtime affects revenue/reputation
Stage 2: Separate Database (1K-10K Users)
Architecture
┌──────────────────────┐
Internet ───→ │ Web/App Server │
│ (8 cores, 32GB) │
└──────────┬───────────┘
│
│ Private Network
│ (10 Gbps)
↓
┌──────────────────────┐
│ Database Server │
│ PostgreSQL │
│ (8 cores, 64GB) │
│ 500GB SSD │
└──────────────────────┘
Capacity Analysis
- Web/App Server: 8 vCPUs, 32GB RAM, handles 100-200 req/sec
- Database Server: 8 vCPUs, 64GB RAM (50% for DB cache), 500GB SSD
- Combined Cost: $400-600/month
- Capacity: 100K-500K daily active users
- Database: 50-200GB data, 500-1K queries/sec
Key Improvements
- Independent Scaling: Scale web and DB separately based on bottlenecks
- Resource Isolation: DB gets dedicated memory for caching (shared_buffers)
- Network Optimization: 10 Gbps private network reduces latency to <1ms
- Backup Strategy: Daily full backups, hourly incrementals (500GB = 30min backup)
Database Optimization
-- PostgreSQL tuning for 64GB RAM server
shared_buffers = 16GB -- 25% of RAM
effective_cache_size = 48GB -- 75% of RAM
work_mem = 128MB -- Per query operation
maintenance_work_mem = 2GB -- For VACUUM, CREATE INDEX
max_connections = 200 -- Application pool size
-- Indexes for common queries
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
CREATE INDEX idx_posts_created_at ON posts(created_at DESC);
Bottlenecks
- Single point of failure - DB downtime = total outage
- Read-heavy workloads - 80% reads, 20% writes is typical
- Connection pooling - Max 200 connections becomes limit
- Query performance - Complex joins slow down at 100M+ rows
Vertical Scaling Runway
- Can scale to 96 cores, 768GB RAM (AWS r7g.metal)
- Handles up to 10K req/sec with proper indexing
- Cost: $5K-10K/month at maximum vertical scale
Stage 3: Load Balancer + Multiple Web Servers (10K-100K Users)
Architecture
┌──────────────────┐
Internet ──────────→ │ Load Balancer │
│ (HAProxy/ALB) │
└────────┬─────────┘
│
┌──────────────────────┼──────────────────────┐
│ │ │
↓ ↓ ↓
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Web-1 │ │ Web-2 │ │ Web-3 │
│ 8 cores │ │ 8 cores │ │ 8 cores │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
└─────────────────────┼─────────────────────┘
│
↓
┌──────────────────┐
│ Database (RW) │
│ 16 cores │
└──────────────────┘
Capacity Analysis
- Load Balancer: AWS ALB (50K connections, 3K req/sec per AZ)
- Web Servers: 3-5 instances, each handles 150-200 req/sec
- Total Capacity: 600-1000 req/sec, 5M-10M requests/day
- Cost: $1K-2K/month (LB: $20 + 5x$160)
- Availability: 99.9% (single AZ) to 99.99% (multi-AZ)
Load Balancing Strategies
1. Round Robin
Request 1 → Web-1
Request 2 → Web-2
Request 3 → Web-3
Request 4 → Web-1 (cycles back)
- Simple, even distribution
- Doesn’t account for server load differences
2. Least Connections
Web-1: 45 active connections → Choose this
Web-2: 67 active connections
Web-3: 52 active connections
- Better for long-lived connections (WebSockets)
- Overhead: LB must track connection counts
3. IP Hash
hash(client_ip) % num_servers = target_server
- Session affinity without sticky sessions
- Problem: Uneven distribution if traffic from NAT/proxy
4. Weighted Round Robin (Production Standard)
Web-1: weight=3 (newer, 16 cores)
Web-2: weight=2 (older, 8 cores)
Web-3: weight=2 (older, 8 cores)
Session Management (Critical for Stateless Design)
Option 1: Sticky Sessions
upstream backend {
ip_hash;
server web1.example.com;
server web2.example.com;
}
- Pros: Simple, no code changes
- Cons: Uneven load, loses sessions on server failure
Option 2: Centralized Session Store (Recommended)
// Node.js with Redis sessions
const session = require('express-session');
const RedisStore = require('connect-redis')(session);
app.use(session({
store: new RedisStore({
host: 'redis-cluster.internal',
port: 6379,
ttl: 86400 // 24 hours
}),
secret: 'keyboard-cat',
resave: false
}));
- Stores sessions in Redis cluster
- Any web server can handle any request
- Session lookup: 1-2ms (vs 50ms database query)
Health Checks and Auto-Recovery
# AWS ALB Target Group Health Check
HealthCheckProtocol: HTTP
HealthCheckPath: /health
HealthCheckIntervalSeconds: 30
HealthyThresholdCount: 2 # 2 successes = healthy
UnhealthyThresholdCount: 3 # 3 failures = unhealthy
Timeout: 5
Application Health Endpoint:
app.get('/health', async (req, res) => {
const checks = {
database: await checkDatabaseConnection(),
redis: await checkRedisConnection(),
diskSpace: checkDiskSpace() > 10 // >10% free
};
const healthy = Object.values(checks).every(v => v === true);
res.status(healthy ? 200 : 503).json({
status: healthy ? 'healthy' : 'unhealthy',
checks
});
});
Bottlenecks
- Database becomes bottleneck - All web servers hit single DB
- No read scaling - Master handles all reads and writes
- Database CPU at 70-80% during peak hours
- Lock contention on hot tables (users, orders)
Stage 4: Database Replication - Master-Slave (100K-500K Users)
Architecture
Load Balancer
│
┌────────┼────────┐
│ │ │
Web-1 Web-2 Web-3
│ │ │
└────────┼────────┘
│
┌───────┴───────┐
│ │
┌────▼────┐ ┌─────▼─────┐
│ Master │───→│ Replica-1 │ (Async Replication)
│ (Write) │ │ (Read) │
│ │───→│ Replica-2 │
│ │ │ (Read) │
└─────────┘ └────────────┘
16 cores 8 cores each
Capacity Analysis
- Master DB: 16 cores, 128GB RAM, 1TB SSD (handles all writes)
- Read Replicas: 2-5 instances, 8 cores, 64GB RAM each
- Write Capacity: 2K-5K writes/sec (limited by master)
- Read Capacity: 10K-50K reads/sec (distributed across replicas)
- Typical Ratio: 80% reads, 20% writes
- Cost: $2K-4K/month (Master: $800, Replicas: $400 each)
Replication Configuration
PostgreSQL Streaming Replication:
# postgresql.conf (Master)
wal_level = replica
max_wal_senders = 5
wal_keep_size = 1GB
synchronous_commit = off # Async for performance
# recovery.conf (Replica)
primary_conninfo = 'host=master-db port=5432 user=replicator'
hot_standby = on
MySQL Replication:
-- Master configuration
SET GLOBAL server_id = 1;
SET GLOBAL binlog_format = 'ROW'; -- Safer than STATEMENT
SET GLOBAL sync_binlog = 1; -- Durability
-- On Replica
CHANGE MASTER TO
MASTER_HOST='master-db.internal',
MASTER_USER='replicator',
MASTER_PASSWORD='secure_password',
MASTER_LOG_FILE='mysql-bin.000001',
MASTER_LOG_POS=107;
START SLAVE;
Application-Level Read/Write Splitting
Approach 1: Manual Routing
# Django example
class UserManager(models.Manager):
def get_queryset(self):
if self._db == 'default':
return super().get_queryset().using('master')
return super().get_queryset().using('replica')
# Usage
user = User.objects.using('replica').get(id=123) # Read from replica
user.email = 'new@email.com'
user.save(using='master') # Write to master
Approach 2: Automatic Routing (Recommended)
# Database router
class ReadReplicaRouter:
def db_for_read(self, model, **hints):
return random.choice(['replica1', 'replica2', 'replica3'])
def db_for_write(self, model, **hints):
return 'master'
# settings.py
DATABASE_ROUTERS = ['myapp.routers.ReadReplicaRouter']
DATABASES = {
'master': {
'ENGINE': 'django.db.backends.postgresql',
'HOST': 'master-db.internal',
'NAME': 'production',
},
'replica1': {
'ENGINE': 'django.db.backends.postgresql',
'HOST': 'replica1-db.internal',
'NAME': 'production',
},
'replica2': { ... }
}
Replication Lag Management
Monitoring Replication Lag:
-- PostgreSQL
SELECT
client_addr,
state,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
EXTRACT(EPOCH FROM (NOW() - pg_last_xact_replay_timestamp())) AS lag_seconds
FROM pg_stat_replication;
Handling Replication Lag in Application:
async function getUserProfile(userId) {
// Write user activity
await masterDB.query('UPDATE users SET last_seen = NOW() WHERE id = $1', [userId]);
// Critical read: use master to avoid stale data
const user = await masterDB.query('SELECT * FROM users WHERE id = $1', [userId]);
// Non-critical read: use replica (may be 100-500ms behind)
const posts = await replicaDB.query('SELECT * FROM posts WHERE user_id = $1', [userId]);
return { user, posts };
}
Production Consideration:
- Acceptable lag: 100ms-1s for most applications
- Alert threshold: >5 seconds (indicates problem)
- Failover threshold: >30 seconds (promote replica to master)
Failover Strategy
Automatic Failover with AWS RDS:
Master fails → RDS detects (30-60 seconds)
→ Promotes read replica to master (60-120 seconds)
→ Updates DNS endpoint (30 seconds)
Total downtime: 2-3 minutes
Manual Failover:
# Promote replica to master
pg_ctl promote -D /var/lib/postgresql/data
# Update application config
# Point writes to new master
# Old master (if recovered) becomes replica
Bottlenecks
- Master write capacity - Single master limits write throughput
- Replication lag - Read replicas 100ms-1s behind master
- Storage growth - 1TB fills up in 6-12 months at 100GB/month
- Hot partitions - Some tables grow to 100M+ rows
Stage 5: Caching Layer (500K-2M Users)
Architecture
Load Balancer
│
┌─────┼─────┐
│ │ │
Web-1 Web-2 Web-3
│ │ │
└─────┼─────┘
│
┌─────┴─────┐
│ │
┌────▼────┐ ┌──▼────┐
│ Redis │ │ DB │
│ Cluster │ │Master │
│ (Cache) │ │ + │
└─────────┘ │Replica│
└───────┘
Redis Cache Architecture
Single Instance (Development):
- Capacity: 10K-50K ops/sec
- Memory: 8-32GB
- Use case: Small apps, development
Redis Cluster (Production):
┌─────────────────────────────────────┐
│ Redis Cluster │
│ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │Master│ │Master│ │Master│ │
│ │ 0-5K │ │5K-10K│ │10K-16│ │
│ └──┬───┘ └──┬───┘ └──┬───┘ │
│ │ │ │ │
│ ┌──▼───┐ ┌─▼────┐ ┌─▼────┐ │
│ │Slave │ │Slave │ │Slave │ │
│ └──────┘ └──────┘ └──────┘ │
└─────────────────────────────────────┘
Configuration:
# redis.conf for production
maxmemory 64gb
maxmemory-policy allkeys-lru # Evict least recently used
save "" # Disable RDB snapshots (use AOF)
appendonly yes # Enable AOF persistence
appendfsync everysec # Fsync every second (balance)
# Cluster mode
cluster-enabled yes
cluster-node-timeout 5000
cluster-replica-validity-factor 0
Capacity Analysis
- Redis Cluster: 6 nodes (3 masters + 3 replicas), 64GB each
- Cache Hit Ratio: 70-90% (target: >85%)
- Throughput: 500K-1M ops/sec
- Latency: <1ms (P50), <5ms (P99)
- Cost: $1K-2K/month (AWS ElastiCache r6g.2xlarge)
Caching Strategies
1. Cache-Aside (Lazy Loading)
def get_user(user_id):
# Try cache first
cached_user = redis.get(f'user:{user_id}')
if cached_user:
return json.loads(cached_user) # Cache hit
# Cache miss: query database
user = db.query('SELECT * FROM users WHERE id = %s', [user_id])
# Store in cache (TTL: 1 hour)
redis.setex(f'user:{user_id}', 3600, json.dumps(user))
return user
2. Write-Through Cache
def update_user(user_id, data):
# Update database
db.query('UPDATE users SET email = %s WHERE id = %s', [data['email'], user_id])
# Update cache immediately
user = db.query('SELECT * FROM users WHERE id = %s', [user_id])
redis.setex(f'user:{user_id}', 3600, json.dumps(user))
return user
3. Write-Behind Cache (Write-Back)
# Buffer writes in Redis, async flush to DB
def update_user_async(user_id, data):
# Update cache immediately
redis.hset(f'user:{user_id}', mapping=data)
redis.sadd('dirty_users', user_id) # Mark for flush
# Background worker flushes every 5 seconds
# Benefit: 10x faster writes, risk: potential data loss
Cache Invalidation Strategies
Time-Based Expiration:
# TTL based on data change frequency
redis.setex('trending_posts', 60, data) # 1 min (changes frequently)
redis.setex('user_profile:123', 3600, data) # 1 hour (moderate)
redis.setex('country_list', 86400, data) # 24 hours (rarely changes)
Event-Based Invalidation:
def update_user_email(user_id, new_email):
# Update database
db.query('UPDATE users SET email = %s WHERE id = %s', [new_email, user_id])
# Invalidate cache
redis.delete(f'user:{user_id}')
redis.delete(f'user:email:{old_email}') # Secondary index
Cache Stampede Prevention:
import threading
def get_expensive_data(key):
# Try cache
data = redis.get(key)
if data:
return data
# Use lock to prevent multiple simultaneous DB queries
lock_key = f'{key}:lock'
if redis.set(lock_key, '1', nx=True, ex=10): # 10 sec lock
try:
# This thread computes and caches
data = expensive_database_query()
redis.setex(key, 3600, data)
return data
finally:
redis.delete(lock_key)
else:
# Other threads wait for cache to populate
time.sleep(0.1)
return get_expensive_data(key) # Retry
What to Cache
High-Value Cache Targets:
- User sessions (100% cache hit rate)
- User profiles (90% hit rate, 1-hour TTL)
- Product catalog (85% hit rate, 5-min TTL)
- API responses (70% hit rate, 1-min TTL)
- Computed aggregations (95% hit rate, 10-min TTL)
Don’t Cache:
- Financial transactions (requires DB accuracy)
- Real-time inventory (stale data causes overselling)
- Personalized recommendations (too many unique keys)
Cache Performance Metrics
# Monitor cache effectiveness
cache_hits = redis.info('stats')['keyspace_hits']
cache_misses = redis.info('stats')['keyspace_misses']
hit_ratio = cache_hits / (cache_hits + cache_misses)
# Alert if hit ratio < 80%
# Indicates poor caching strategy or TTL tuning needed
Bottlenecks After Caching
- Static assets (images, CSS, JS) still served from origin
- Geographic latency for international users (100-300ms)
- Network bandwidth costs increasing (500GB-5TB/month)
- Cache memory pressure at 80-90% capacity
Stage 6: CDN for Static Assets (2M-5M Users)
Architecture
┌──────────────────┐
User (US) ────→│ CDN Edge (US) │ (Cache Hit: 95%)
└────────┬─────────┘
│ (5% cache miss)
↓
┌────────────────────┐
User (EU) ────→│ CDN Edge (EU) │
└────────┬───────────┘
│
↓
┌────────────────────┐
│ Origin Server │
│ (Your Web Server) │
└─────────────────────┘
CDN Capacity Analysis
- CDN Providers: CloudFront, Cloudflare, Fastly, Akamai
- Edge Locations: 200-400 global points of presence (PoPs)
- Cache Hit Ratio: 85-95% for static assets
- Latency Reduction: 200ms → 20-50ms (4-10x improvement)
- Bandwidth Offload: 70-90% of traffic served from edge
- Cost: $0.08-0.15/GB (decreases with volume commits)
CDN Configuration
CloudFront Distribution:
{
"Origins": [{
"DomainName": "origin.example.com",
"CustomHeaders": [{
"HeaderName": "X-Origin-Verify",
"HeaderValue": "secret-token"
}]
}],
"DefaultCacheBehavior": {
"ViewerProtocolPolicy": "redirect-to-https",
"AllowedMethods": ["GET", "HEAD", "OPTIONS"],
"CachedMethods": ["GET", "HEAD"],
"Compress": true,
"DefaultTTL": 86400, // 24 hours
"MinTTL": 0,
"MaxTTL": 31536000, // 1 year
"ForwardedValues": {
"QueryString": false,
"Cookies": { "Forward": "none" }
}
}
}
Cache-Control Headers:
// Express.js middleware for static assets
app.use('/static', express.static('public', {
maxAge: '1y', // Browser + CDN cache for 1 year
immutable: true
}));
// Dynamic content with versioned URLs
app.get('/api/config', (req, res) => {
res.set('Cache-Control', 'public, max-age=300'); // 5 min CDN cache
res.json({ version: '2.0', features: [] });
});
// User-specific content
app.get('/api/profile', authenticate, (req, res) => {
res.set('Cache-Control', 'private, no-store'); // No caching
res.json({ user: req.user });
});
Asset Optimization
Image Optimization:
// Next.js Image component (automatic optimization)
<Image
src="/large-photo.jpg"
width={800}
height={600}
formats={['image/avif', 'image/webp']} // Modern formats
quality={85} // 85% quality (optimal tradeoff)
/>
// Savings: 3MB JPEG → 300KB WebP (10x reduction)
Code Splitting and Bundling:
// Webpack code splitting
import(/* webpackChunkName: "dashboard" */ './Dashboard')
.then(module => {
const Dashboard = module.default;
// Load dashboard code only when needed
});
// Result:
// - Initial bundle: 500KB → 100KB
// - Dashboard chunk: 400KB (loaded on demand)
CDN Cache Invalidation
Versioned URLs (Recommended):
<!-- Cache forever, change URL for new version -->
<script src="/static/js/app.v2.7.3.min.js"></script>
<link rel="stylesheet" href="/static/css/main.abc123.css">
Explicit Invalidation:
# CloudFront invalidation (costs $0.005 per path)
aws cloudfront create-invalidation \
--distribution-id E1234567890 \
--paths "/css/*" "/js/*"
# Cloudflare purge (free, instant)
curl -X POST "https://api.cloudflare.com/client/v4/zones/{zone_id}/purge_cache" \
-H "Authorization: Bearer {token}" \
-d '{"files":["https://example.com/css/main.css"]}'
Multi-Tier Caching Strategy
Browser Cache (1 hour)
↓ miss
CDN Edge Cache (24 hours)
↓ miss
CDN Shield/Origin (no cache)
↓ miss
Application Server (Redis: 1 hour)
↓ miss
Database
Effective TTL Stacking:
- HTML: 5 min (frequently updated)
- CSS/JS: 1 year (versioned URLs)
- Images: 1 week (semi-static)
- API responses: 1-5 min (data freshness)
Cost Optimization
Without CDN:
- Origin bandwidth: 10TB/month × $0.09/GB = $900
With CDN (90% cache hit rate):
- CDN bandwidth: 10TB × $0.085/GB = $850
- Origin bandwidth: 1TB × $0.09/GB = $90
- Total: $940
Net effect: Similar cost, but 10x better performance
At scale (100TB/month): $5K with CDN vs $9K without
Stage 7: Database Sharding (5M-20M Users)
When to Shard
- Single database exceeds 1TB (query performance degrades)
- Write throughput exceeds 10K writes/sec (master bottleneck)
- Replication lag consistently >5 seconds
- Tables with >500M rows causing slow queries even with indexes
Sharding Strategies
1. Horizontal Partitioning by User ID (Hash-Based)
Users Table (100M rows):
┌──────────┬───────────────┐
│ Shard 1 │ user_id % 4=0│
│ (25M) │ 0,4,8,12... │
├──────────┼───────────────┤
│ Shard 2 │ user_id % 4=1│
│ (25M) │ 1,5,9,13... │
├──────────┼───────────────┤
│ Shard 3 │ user_id % 4=2│
│ (25M) │ 2,6,10,14... │
├──────────┼───────────────┤
│ Shard 4 │ user_id % 4=3│
│ (25M) │ 3,7,11,15... │
└──────────┴───────────────┘
Implementation:
def get_shard_for_user(user_id, num_shards=4):
return user_id % num_shards
def get_user_posts(user_id):
shard_id = get_shard_for_user(user_id)
db = get_database_connection(f'shard_{shard_id}')
return db.query('SELECT * FROM posts WHERE user_id = %s', [user_id])
2. Range-Based Sharding (Geographic)
┌────────────┬──────────────────┐
│ Shard US │ country='US' │
│ (40M) │ lat: 25-50°N │
├────────────┼──────────────────┤
│ Shard EU │ country IN(...) │
│ (30M) │ lat: 35-60°N │
├────────────┼──────────────────┤
│ Shard ASIA│ country IN(...) │
│ (20M) │ lat: 0-50°N │
└────────────┴──────────────────┘
3. Directory-Based Sharding (Flexible)
┌──────────────────────────────┐
│ Shard Lookup Service │
│ user_id → shard_id mapping │
└──────────────────────────────┘
│
├─→ Shard 1: users 0-10M
├─→ Shard 2: users 10M-20M
├─→ Shard 3: users 20M-30M
└─→ Shard 4: VIP users (premium tier)
Cross-Shard Queries Challenge
Problem:
-- This query requires scanning ALL shards
SELECT * FROM posts
WHERE created_at > '2024-01-01'
ORDER BY created_at DESC
LIMIT 100;
Solution 1: Scatter-Gather
def get_recent_posts_global(limit=100):
shards = get_all_shards()
futures = []
# Query all shards in parallel
with ThreadPoolExecutor(max_workers=8) as executor:
for shard in shards:
future = executor.submit(
shard.query,
'SELECT * FROM posts WHERE created_at > %s ORDER BY created_at DESC LIMIT %s',
['2024-01-01', limit]
)
futures.append(future)
# Merge results
all_posts = []
for future in futures:
all_posts.extend(future.result())
# Sort and limit
return sorted(all_posts, key=lambda x: x['created_at'], reverse=True)[:limit]
Solution 2: Denormalization (Recommended)
- Maintain separate "global feed" table in dedicated database
- Use Change Data Capture (CDC) to populate from shards
- Trade consistency for query performance
Shard Rebalancing
Adding New Shard (Consistent Hashing):
# Before: 4 shards (0-3)
shard = user_id % 4
# After: 5 shards (0-4)
shard = user_id % 5
# Problem: 80% of keys need to move!
# Solution: Consistent hashing
Consistent Hashing Implementation:
import hashlib
class ConsistentHash:
def __init__(self, nodes, virtual_nodes=150):
self.ring = {}
self.sorted_keys = []
self.virtual_nodes = virtual_nodes
for node in nodes:
self.add_node(node)
def add_node(self, node):
for i in range(self.virtual_nodes):
key = self._hash(f'{node}:{i}')
self.ring[key] = node
self.sorted_keys.append(key)
self.sorted_keys.sort()
def get_node(self, key):
if not self.ring:
return None
hash_key = self._hash(key)
# Binary search for nearest node
for ring_key in self.sorted_keys:
if hash_key <= ring_key:
return self.ring[ring_key]
return self.ring[self.sorted_keys[0]]
def _hash(self, key):
return int(hashlib.md5(key.encode()).hexdigest(), 16)
# Usage
shards = ConsistentHash(['shard1', 'shard2', 'shard3', 'shard4'])
shard = shards.get_node(f'user:{user_id}')
# Add new shard: only ~20% of keys need to move
shards.add_node('shard5')
Shard Management at Scale (10+ Shards)
Vitess (MySQL Sharding):
# Vitess topology
keyspaces:
- name: users
sharded: true
shards:
- name: "-80" # Hash range 0x00-0x80
master: shard1-master
replicas: [shard1-replica1, shard1-replica2]
- name: "80-" # Hash range 0x80-0xFF
master: shard2-master
replicas: [shard2-replica1, shard2-replica2]
Citus (PostgreSQL Sharding):
-- Designate distributed tables
SELECT create_distributed_table('posts', 'user_id');
SELECT create_distributed_table('comments', 'user_id');
-- Citus automatically routes queries to correct shard
SELECT * FROM posts WHERE user_id = 12345;
Sharding Tradeoffs
Benefits:
- Near-linear write scalability (4 shards = 4x write capacity)
- Reduced blast radius (one shard failure affects 25% of users)
- Cost optimization (smaller databases are cheaper to maintain)
Challenges:
- Application complexity (shard routing logic)
- Cross-shard transactions require 2PC (slow, avoid if possible)
- Resharding is expensive (hours to days of migration)
- Monitoring complexity (4x databases to monitor)
Production Capacity
- Per Shard: 16 cores, 128GB RAM, 1TB SSD
- Total Capacity: 4-16 shards = 40K-150K writes/sec
- Data Size: 4-16TB total (1TB per shard)
- Cost: $10K-40K/month depending on shard count
Stage 8: Message Queues and Async Processing (20M+ Users)
Architecture
Web Servers
│
├──→ [Sync Request] ──→ Database
│
└──→ [Async Task] ──→ Message Queue
│
┌───────────────────┼───────────────┐
↓ ↓ ↓
Worker-1 Worker-2 Worker-3
(Email) (Image Process) (Analytics)
When to Use Async Processing
- Long-running tasks (>1 second response time)
- Non-critical operations (can tolerate delay)
- Resource-intensive (CPU/memory heavy)
- Third-party API calls (unreliable, slow)
- Batch operations (bulk updates, reports)
Message Queue Technologies
RabbitMQ (Traditional):
- Throughput: 10K-50K messages/sec per node
- Latency: 5-20ms
- Persistence: Durable queues with disk backing
- Features: Complex routing, priority queues
- Cost: $200-500/month (3-node cluster)
Amazon SQS (Serverless):
- Throughput: Unlimited (managed service)
- Latency: 10-50ms
- Persistence: 14-day retention
- Features: Dead-letter queues, FIFO queues
- Cost: $0.40 per million requests ($400 for 1B messages)
Apache Kafka (High-Throughput):
- Throughput: 1M+ messages/sec per cluster
- Latency: <10ms (P99)
- Persistence: Days to weeks (configurable)
- Features: Pub/sub, stream processing, replay
- Cost: $1K-5K/month (3-node cluster)
Implementation Examples
Email Processing with SQS:
# Producer (web server)
import boto3
sqs = boto3.client('sqs')
queue_url = 'https://sqs.us-east-1.amazonaws.com/123456/email-queue'
@app.post('/register')
def register_user(email, password):
# Synchronous: create user in database
user = User.create(email=email, password=hash_password(password))
# Asynchronous: send welcome email (non-blocking)
sqs.send_message(
QueueUrl=queue_url,
MessageBody=json.dumps({
'user_id': user.id,
'email': email,
'template': 'welcome'
})
)
return {'status': 'success', 'user_id': user.id}
# Consumer (worker process)
while True:
messages = sqs.receive_message(QueueUrl=queue_url, MaxNumberOfMessages=10)
for message in messages.get('Messages', []):
try:
data = json.loads(message['Body'])
send_email(data['email'], template=data['template'])
# Delete message after processing
sqs.delete_message(
QueueUrl=queue_url,
ReceiptHandle=message['ReceiptHandle']
)
except Exception as e:
# Message will return to queue after visibility timeout
logger.error(f'Failed to process message: {e}')
Image Processing with Kafka:
# Producer
from kafka import KafkaProducer
producer = KafkaProducer(
bootstrap_servers=['kafka1:9092', 'kafka2:9092'],
value_serializer=lambda v: json.dumps(v).encode('utf-8')
)
@app.post('/upload')
def upload_image(file):
# Save original image to S3
s3_key = upload_to_s3(file)
# Queue thumbnail generation
producer.send('image-processing', {
'image_id': generate_id(),
's3_key': s3_key,
'operations': ['thumbnail', 'compress', 'watermark']
})
return {'status': 'processing', 's3_key': s3_key}
# Consumer
from kafka import KafkaConsumer
from PIL import Image
consumer = KafkaConsumer(
'image-processing',
bootstrap_servers=['kafka1:9092', 'kafka2:9092'],
group_id='image-workers',
auto_offset_reset='earliest'
)
for message in consumer:
data = json.loads(message.value)
# Download image
image = download_from_s3(data['s3_key'])
# Process
if 'thumbnail' in data['operations']:
thumbnail = Image.open(image).resize((200, 200))
upload_to_s3(thumbnail, f"{data['image_id']}_thumb.jpg")
# Takes 5-10 seconds, doesn't block web server
Queue Patterns
1. Task Queue (Work Distribution)
Producer → Queue → [Worker-1, Worker-2, Worker-3]
(Compete for tasks)
2. Pub/Sub (Fanout)
Publisher → Topic → [Subscriber-A: Email Service]
→ [Subscriber-B: Analytics]
→ [Subscriber-C: Audit Log]
3. Priority Queue
# RabbitMQ with priority
channel.basic_publish(
exchange='',
routing_key='tasks',
body=message,
properties=pika.BasicProperties(priority=5) # 0-10, higher = more urgent
)
Retry and Error Handling
Exponential Backoff:
def process_with_retry(task, max_retries=3):
for attempt in range(max_retries):
try:
return execute_task(task)
except Exception as e:
if attempt == max_retries - 1:
# Send to dead-letter queue
dlq.send(task)
logger.error(f'Task failed after {max_retries} attempts: {e}')
return
# Exponential backoff: 1s, 2s, 4s, 8s...
sleep_time = 2 ** attempt
time.sleep(sleep_time)
Dead-Letter Queue (DLQ):
# SQS DLQ configuration
{
"RedrivePolicy": {
"deadLetterTargetArn": "arn:aws:sqs:us-east-1:123456:failed-tasks",
"maxReceiveCount": 3 # After 3 failures, move to DLQ
}
}
# Monitor DLQ for issues
def monitor_dlq():
dlq_size = sqs.get_queue_attributes(
QueueUrl=dlq_url,
AttributeNames=['ApproximateNumberOfMessages']
)
if dlq_size > 100:
alert_ops_team('High DLQ size, investigate failures')
Capacity Planning
- Queue Size: Monitor queue depth (alert if >10K messages)
- Worker Scaling: Auto-scale based on queue length
- Throughput: 10K-100K tasks/sec with proper parallelization
- Cost: $500-2K/month (queue + workers)
Stage 9: Microservices Architecture (50M+ Users)
Monolith to Microservices Evolution
Monolith (Simple, Single Deployment):
┌───────────────────────────────────┐
│ Monolithic Application │
│ ┌─────────────────────────────┐ │
│ │ Auth │ Users │ Posts │ Pay │ │
│ └─────────────────────────────┘ │
│ Single Database │
└───────────────────────────────────┘
Microservices (Distributed, Independent):
┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐
│ Auth │ │ Users │ │ Posts │ │ Payment │
│ Service │ │ Service │ │ Service │ │ Service │
│ ↓ │ │ ↓ │ │ ↓ │ │ ↓ │
│ Auth DB │ │ Users DB │ │ Posts DB │ │ Pay DB │
└────────────┘ └────────────┘ └────────────┘ └────────────┘
When to Adopt Microservices
- Team size >50 engineers (coordination overhead)
- Monolith >500K lines of code (deploy takes >30 min)
- Need independent scaling (payment spikes during Black Friday)
- Different tech stacks (ML service in Python, core in Java)
- Regulatory requirements (isolate PCI-compliant services)
Service Decomposition Strategy
By Business Domain (DDD):
┌──────────────────┐
│ User Service │ - Authentication, profiles, preferences
│ (10 endpoints) │ - Team: 8 engineers
└──────────────────┘
┌──────────────────┐
│ Content Service │ - Posts, comments, likes
│ (15 endpoints) │ - Team: 12 engineers
└──────────────────┘
┌──────────────────┐
│ Payment Service │ - Subscriptions, transactions, billing
│ (8 endpoints) │ - Team: 6 engineers (PCI compliance)
└──────────────────┘
Service Communication
1. Synchronous (REST API):
# User service calls Payment service
import requests
def get_user_with_subscription(user_id):
user = db.query('SELECT * FROM users WHERE id = %s', [user_id])
# HTTP call to payment service
response = requests.get(
f'http://payment-service.internal/subscriptions/{user_id}',
timeout=2 # Critical: set timeouts
)
if response.status_code == 200:
user['subscription'] = response.json()
return user
Problem: Cascading Failures
User Service (99.9% uptime)
↓ calls
Payment Service (99.9% uptime)
↓ calls
Email Service (99.9% uptime)
Combined uptime: 99.9% × 99.9% × 99.9% = 99.7%
More dependencies = worse reliability
2. Asynchronous (Event-Driven):
# User service publishes event
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers=['kafka:9092'])
def create_user(email, password):
user = User.create(email=email, password=password)
# Publish event (fire and forget)
producer.send('user.created', {
'user_id': user.id,
'email': email,
'timestamp': datetime.utcnow().isoformat()
})
return user
# Payment service subscribes to events
consumer = KafkaConsumer('user.created', group_id='payment-service')
for message in consumer:
event = json.loads(message.value)
# Create free trial subscription
create_trial_subscription(event['user_id'])
Benefits:
- Loose coupling (services don’t directly depend on each other)
- Better fault tolerance (payment service down doesn’t block user creation)
- Audit trail (event log is source of truth)
API Gateway Pattern
┌──────────────────┐
Client ─────────→ │ API Gateway │
│ (Kong/AWS ALB) │
└────────┬─────────┘
│
┌─────────────────────┼─────────────────────┐
│ │ │
↓ ↓ ↓
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Auth │ │ Users │ │ Posts │
│ Service │ │ Service │ │ Service │
└─────────┘ └─────────┘ └─────────┘
API Gateway Responsibilities:
- Authentication/Authorization
- Rate limiting (1000 req/min per user)
- Request routing (/api/users/* → User Service)
- Response aggregation (combine multiple service calls)
- Protocol translation (HTTP → gRPC)
Kong Configuration:
services:
- name: user-service
url: http://users.internal:8080
routes:
- name: user-routes
paths:
- /api/users
plugins:
- name: rate-limiting
config:
minute: 1000
- name: jwt
config:
claims_to_verify:
- exp
Service Mesh (Advanced)
Istio/Linkerd for Service-to-Service Communication:
┌──────────┐ ┌──────────┐
│ Service │ │ Service │
│ A │ ←────→ │ B │
│ ↕️ │ │ ↕️ │
│ Sidecar │ ← TLS → │ Sidecar │ (Automatic encryption)
└──────────┘ └──────────┘
Features:
- Automatic mTLS (mutual TLS encryption)
- Traffic management (canary deployments, A/B testing)
- Observability (distributed tracing with Jaeger)
- Circuit breaking (prevent cascading failures)
Circuit Breaker Configuration:
# Istio DestinationRule
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: payment-service
spec:
host: payment-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 50
maxRequestsPerConnection: 2
outlierDetection:
consecutiveErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
Data Consistency in Microservices
Problem: Distributed Transactions
User creates order:
1. Order Service: Create order (SUCCESS)
2. Payment Service: Charge card (FAIL)
3. Inventory Service: Decrease stock (PENDING)
Result: Inconsistent state
Solution: Saga Pattern
# Orchestration-based saga
class OrderSaga:
def execute(self, order_data):
try:
# Step 1: Create order
order = order_service.create_order(order_data)
# Step 2: Reserve inventory
inventory_service.reserve(order.items)
# Step 3: Process payment
payment = payment_service.charge(order.user_id, order.total)
# Step 4: Confirm order
order_service.confirm_order(order.id)
except PaymentFailed:
# Compensating transaction: unreserve inventory
inventory_service.unreserve(order.items)
order_service.cancel_order(order.id)
raise
Microservices Tradeoffs
Benefits:
- Independent deployability (deploy 10x per day)
- Technology diversity (use best tool for job)
- Team autonomy (reduce coordination)
- Fault isolation (one service failure doesn’t crash entire system)
Challenges:
- Operational complexity (10 services = 10 deployments, 10 databases)
- Distributed debugging (trace request across services)
- Data consistency (eventual consistency)
- Network overhead (inter-service calls add 10-50ms latency)
- Cost: 20-30% higher infrastructure costs vs monolith
Production Scale
- Services: 20-100 microservices
- Engineers: 100-500 (5-10 per service)
- Deployment Frequency: 100-1000 deploys/day
- Infrastructure Cost: $50K-200K/month
Stage 10: Monitoring, Logging, and Observability (Production-Critical)
Three Pillars of Observability
1. Metrics (What is happening?)
System Health:
- CPU: 45% average, 78% P95
- Memory: 62% used, 38% free
- Disk I/O: 1200 IOPS, 80MB/s
- Network: 500 Mbps in, 800 Mbps out
Application:
- Request rate: 5000 req/sec
- Error rate: 0.5% (25 errors/sec)
- Latency: P50=50ms, P95=200ms, P99=500ms
- Database connections: 150/200 in use
2. Logs (Why did it happen?)
{
"timestamp": "2024-01-15T10:30:45.123Z",
"level": "error",
"service": "payment-service",
"trace_id": "abc123",
"user_id": 456789,
"message": "Payment declined",
"error": {
"code": "insufficient_funds",
"card_last4": "4242"
}
}
3. Traces (How did request flow?)
Request ID: abc123
Total Duration: 450ms
1. API Gateway: 2ms
2. Auth Service: 15ms
3. User Service: 50ms
└─ Database query: 45ms
4. Payment Service: 350ms ← SLOW
├─ Stripe API call: 320ms ← ROOT CAUSE
└─ Database write: 25ms
5. Response: 8ms
Monitoring Stack
Prometheus + Grafana (Metrics)
# Prometheus scrape config
scrape_configs:
- job_name: 'web-servers'
static_configs:
- targets: ['web1:9090', 'web2:9090', 'web3:9090']
scrape_interval: 15s
# Application metrics (Node.js example)
const client = require('prom-client');
const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_ms',
help: 'Duration of HTTP requests in ms',
labelNames: ['method', 'route', 'status_code'],
buckets: [10, 50, 100, 200, 500, 1000, 2000, 5000]
});
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = Date.now() - start;
httpRequestDuration
.labels(req.method, req.route?.path || 'unknown', res.statusCode)
.observe(duration);
});
next();
});
ELK Stack (Elasticsearch, Logstash, Kibana) for Logs
// Structured logging (best practice)
logger.info('User login', {
user_id: 123,
email: 'user@example.com',
ip: req.ip,
user_agent: req.headers['user-agent'],
duration_ms: 45
});
// Avoid unstructured logs
// logger.info(`User ${email} logged in from ${ip}`); // Hard to search
Jaeger/Zipkin (Distributed Tracing)
# OpenTelemetry instrumentation
from opentelemetry import trace
from opentelemetry.instrumentation.requests import RequestsInstrumentor
RequestsInstrumentor().instrument()
tracer = trace.get_tracer(__name__)
def process_order(order_id):
with tracer.start_as_current_span('process_order') as span:
span.set_attribute('order.id', order_id)
# This HTTP call is automatically traced
user = requests.get(f'http://user-service/users/{user_id}')
with tracer.start_as_current_span('database_query'):
order = db.query('SELECT * FROM orders WHERE id = %s', [order_id])
return order
Key Metrics to Monitor
System Metrics:
# Collect every 15 seconds
metrics = {
'cpu_percent': psutil.cpu_percent(interval=1),
'memory_percent': psutil.virtual_memory().percent,
'disk_io_read_bytes': psutil.disk_io_counters().read_bytes,
'disk_io_write_bytes': psutil.disk_io_counters().write_bytes,
'network_sent_bytes': psutil.net_io_counters().bytes_sent,
'network_recv_bytes': psutil.net_io_counters().bytes_recv
}
Application Metrics (RED Method):
- Rate: Requests per second
- Errors: Error rate (percentage or count)
- Duration: Response time (P50, P95, P99)
Database Metrics:
-- PostgreSQL monitoring queries
SELECT count(*) FROM pg_stat_activity WHERE state = 'active'; -- Active connections
SELECT pg_database_size('production') / 1024 / 1024 / 1024 AS size_gb; -- DB size
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
FROM pg_tables ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 10;
Alerting Strategy
Alert Severity Levels:
P0 (Page immediately, 24/7):
- Service completely down (>50% error rate)
- Database unavailable
- Payment processing failures
P1 (Page during business hours):
- High error rate (>5%)
- High latency (P95 > 2 seconds)
- Disk space >90% full
P2 (Create ticket, no page):
- Elevated error rate (>1%)
- Slow queries (>500ms)
- Memory usage >80%
AlertManager Configuration:
groups:
- name: api-alerts
interval: 30s
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
annotations:
summary: "High error rate on {{ $labels.instance }}"
description: "Error rate is {{ $value }} (threshold: 5%)"
- alert: HighLatency
expr: histogram_quantile(0.95, http_request_duration_ms) > 2000
for: 10m
annotations:
summary: "High latency on {{ $labels.route }}"
description: "P95 latency is {{ $value }}ms"
Log Aggregation at Scale
Challenges:
- 10K req/sec = 864M requests/day
- Each request logs 3-5 lines = 2.5B log lines/day
- Average log size: 500 bytes = 1.25TB/day
- Retention: 30 days = 37.5TB storage
Cost Optimization:
# Sample logs (log 10% of requests)
if random.random() < 0.1 or response.status_code >= 400:
logger.info('Request processed', {
'path': request.path,
'method': request.method,
'status': response.status_code,
'duration_ms': duration
})
# Result: 10x cost reduction, still capture all errors
Tiered Storage:
Hot Storage (1-7 days): Elasticsearch - Fast search, $50/TB/month
Warm Storage (8-30 days): S3 + Athena - Queryable, $23/TB/month
Cold Storage (>30 days): S3 Glacier - Archive, $4/TB/month
Production Monitoring Costs
- Metrics: $500-2K/month (Datadog/New Relic)
- Logs: $1K-5K/month (10TB/month ingestion)
- Traces: $500-1K/month (sample 1-10% of requests)
- Total: $2K-8K/month for comprehensive observability
Stage 11: Auto-Scaling Strategies
Horizontal Auto-Scaling
Metric-Based Scaling (AWS Auto Scaling Group):
AutoScalingGroup:
MinSize: 3
MaxSize: 50
DesiredCapacity: 5
ScalingPolicies:
- PolicyName: ScaleUpOnCPU
MetricName: CPUUtilization
TargetValue: 70
ScaleUpCooldown: 300s # Wait 5 min before scaling up again
- PolicyName: ScaleUpOnRequests
MetricName: RequestCountPerTarget
TargetValue: 1000 # 1000 requests per instance
ScaleDownCooldown: 600s # Wait 10 min before scaling down
Predictive Scaling:
# AWS Predictive Scaling analyzes historical data
# Scales BEFORE traffic spikes
Typical pattern:
08:00 - 10:00: Morning spike (+200%)
12:00 - 13:00: Lunch spike (+150%)
18:00 - 20:00: Evening spike (+250%)
Predictive scaling at 07:45: Scale from 5 → 15 instances
(vs reactive scaling at 08:10: Slow, users experience lag)
Kubernetes Horizontal Pod Autoscaler (HPA):
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 3
maxReplicas: 100
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Database Auto-Scaling
Aurora Serverless (AWS):
Engine: aurora-postgresql
MinCapacity: 2 ACU # 2 Aurora Capacity Units = 4GB RAM
MaxCapacity: 64 ACU # 64 ACU = 128GB RAM
AutoPause: true
AutoPauseDelay: 300 # Pause after 5 min of inactivity
Cost:
- Active: $0.06/ACU/hour
- Storage: $0.10/GB/month
- Paused: Only pay for storage
Use case: Dev/test environments, sporadic traffic
Connection Pooling (PgBouncer):
# PgBouncer config
[databases]
production = host=db.internal port=5432
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000 # Accept 1000 app connections
default_pool_size = 25 # But only 25 DB connections
reserve_pool_size = 5 # Emergency pool
# Result: 1000 app servers can share 25 DB connections
# DB connection overhead: 10MB/conn → 250MB instead of 10GB
Auto-Scaling Best Practices
1. Scale on Multiple Metrics:
Scale UP if:
- CPU > 70% for 5 minutes, OR
- Request rate > 1000/instance, OR
- Error rate > 1%
Scale DOWN if:
- CPU < 30% for 15 minutes, AND
- Request rate < 200/instance, AND
- Error rate < 0.1%
2. Graceful Shutdown:
import signal
import time
def graceful_shutdown(signum, frame):
logger.info('Received SIGTERM, starting graceful shutdown')
# Stop accepting new requests
server.stop_accepting()
# Wait for in-flight requests to complete (max 30 sec)
timeout = 30
while server.active_requests() > 0 and timeout > 0:
time.sleep(1)
timeout -= 1
# Close database connections
db.close_all_connections()
logger.info('Graceful shutdown complete')
sys.exit(0)
signal.signal(signal.SIGTERM, graceful_shutdown)
3. Pre-Warming:
# Warm up instance before adding to load balancer
def health_check():
if not is_warmed_up:
# Pre-load caches, establish connections
redis.ping()
db.execute('SELECT 1')
load_config_cache()
is_warmed_up = True
return {'status': 'healthy' if is_warmed_up else 'warming'}
Stage 12: Multi-Region Deployment (Global Scale)
Global Architecture
┌──────────────────┐
│ Route 53 (DNS) │
│ Geo-routing │
└────────┬─────────┘
│
┌──────────────┼──────────────┐
│ │ │
↓ ↓ ↓
┌─────────────┐ ┌──────────┐ ┌──────────┐
│ US-EAST-1 │ │ EU-WEST-1│ │AP-SOUTH-1│
│ (Primary) │ │ (Backup) │ │ (Active) │
│ │ │ │ │ │
│ LB + 20 Web │ │LB + 15 W │ │LB + 10 W │
│ Master DB │ │Replica DB│ │Replica DB│
│ Redis │ │Redis │ │Redis │
└─────────────┘ └──────────┘ └──────────┘
Multi-Region Database Strategy
1. Active-Passive (Disaster Recovery):
US-EAST-1 (Primary):
- Handles 100% writes
- Handles 100% reads
EU-WEST-1 (Passive):
- Receives async replication (lag: 1-5 seconds)
- Handles 0% traffic (warm standby)
Failover:
- Manual failover: 5-10 minutes
- Automatic failover: 1-2 minutes (AWS RDS)
- Data loss: Up to 5 seconds (RPO=5s)
2. Active-Active (Read Replicas):
US-EAST-1 (Primary):
- Handles 100% writes
- Handles 60% reads (US users)
EU-WEST-1 (Read Replica):
- Handles 40% reads (EU users)
- Promotes to master if US fails
Latency improvement:
- US user: 150ms → 20ms
- EU user: 200ms → 15ms
3. Multi-Master (CockroachDB, Cassandra):
US-EAST-1: EU-WEST-1: AP-SOUTH-1:
Write + Read Write + Read Write + Read
↓ ↓ ↓
┌───────────────┴────────────────┐
│ Distributed Consensus (Raft) │
│ Eventual Consistency │
└─────────────────────────────────┘
Benefits:
- No single point of failure
- Write to nearest region (low latency)
Tradeoffs:
- Eventual consistency (not ACID)
- Conflict resolution needed
CDN + Multi-Region
CloudFront with Regional Failover:
Origins:
- Id: us-origin
DomainName: us-app.example.com
Priority: 1
- Id: eu-origin
DomainName: eu-app.example.com
Priority: 2 # Failover
OriginGroups:
- FailoverCriteria:
StatusCodes: [500, 502, 503, 504]
Members:
- OriginId: us-origin
- OriginId: eu-origin
# If US origin fails, automatically route to EU
Data Sovereignty and Compliance
GDPR Requirements:
EU user data MUST stay in EU:
User from Germany:
├─ Route53 → EU-WEST-1 region
├─ Data stored in Frankfurt data center
├─ Backups stored in EU
└─ Analytics processed in EU
US user data:
├─ Route53 → US-EAST-1 region
├─ Data stored in Virginia data center
└─ Not subject to GDPR
Implementation:
def get_user_region(ip_address):
# GeoIP lookup
country = geoip.country(ip_address)
if country in ['DE', 'FR', 'GB', 'IT', 'ES']:
return 'eu-west-1'
elif country in ['IN', 'SG', 'JP', 'AU']:
return 'ap-south-1'
else:
return 'us-east-1'
def create_user(email, password, ip_address):
region = get_user_region(ip_address)
db = get_database(region)
user = db.create_user(email=email, password=password, region=region)
# User data never leaves designated region
return user
Disaster Recovery (DR)
RTO and RPO:
- RTO (Recovery Time Objective): How long until service restored?
- RPO (Recovery Point Objective): How much data loss acceptable?
DR Tiers:
Tier 1 - Cold Standby:
- RTO: 24-72 hours
- RPO: 24 hours
- Cost: 10% of primary
- Method: Backup tapes, offsite storage
Tier 2 - Warm Standby:
- RTO: 1-4 hours
- RPO: 1 hour
- Cost: 30% of primary
- Method: Replicated DB, no active servers
Tier 3 - Hot Standby:
- RTO: 5-15 minutes
- RPO: 1 minute
- Cost: 50% of primary
- Method: Active-passive with auto-failover
Tier 4 - Active-Active:
- RTO: 0 seconds (no downtime)
- RPO: 0 seconds (no data loss)
- Cost: 100% of primary (2x infrastructure)
- Method: Multi-region active-active
AWS Multi-Region DR:
# Automated failover with Route53 health checks
aws route53 change-resource-record-sets --hosted-zone-id Z123 --change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "api.example.com",
"Type": "A",
"SetIdentifier": "us-east-1",
"Failover": "PRIMARY",
"AliasTarget": {
"HostedZoneId": "Z123",
"DNSName": "us-load-balancer.aws.com",
"EvaluateTargetHealth": true
}
}
}, {
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "api.example.com",
"Type": "A",
"SetIdentifier": "eu-west-1",
"Failover": "SECONDARY",
"AliasTarget": {
"HostedZoneId": "Z456",
"DNSName": "eu-load-balancer.aws.com",
"EvaluateTargetHealth": true
}
}
}]
}'
# If US health check fails → DNS automatically points to EU
Multi-Region Cost Analysis
Infrastructure Costs:
Single Region (US-EAST-1):
- Compute: 50 instances × $200/month = $10,000
- Database: $5,000/month
- CDN: $2,000/month
- Total: $17,000/month
Multi-Region (US + EU + ASIA):
- Compute: 100 instances × $200/month = $20,000 (2x)
- Database: $12,000/month (replication + storage)
- CDN: $3,000/month (multi-region)
- Cross-region data transfer: $2,000/month
- Total: $37,000/month (2.2x cost)
Benefits:
- 50-70% latency reduction for international users
- 99.99% availability (vs 99.9% single region)
- Regulatory compliance (GDPR, data residency)
Conclusion: Scaling Journey Summary
Capacity Roadmap
Stage 1-2: Single Server → Separate DB (0-10K Users)
- Cost: $100-600/month
- Complexity: Low
- Team: 1-3 engineers
Stage 3-4: Load Balancer → DB Replication (10K-100K Users)
- Cost: $1K-4K/month
- Complexity: Medium
- Team: 3-10 engineers
Stage 5-6: Caching → CDN (100K-2M Users)
- Cost: $3K-10K/month
- Complexity: Medium-High
- Team: 10-30 engineers
Stage 7-8: Sharding → Message Queues (2M-20M Users)
- Cost: $10K-50K/month
- Complexity: High
- Team: 30-100 engineers
Stage 9-12: Microservices → Multi-Region (20M+ Users)
- Cost: $50K-500K/month
- Complexity: Very High
- Team: 100-1000+ engineers
Key Takeaways
-
Scale gradually - Don’t over-engineer early. Instagram scaled to 30M users with 3 engineers and PostgreSQL.
-
Measure everything - You can’t optimize what you don’t measure. Start with basic monitoring from day one.
-
Eliminate single points of failure - Load balancers, multiple databases, multi-AZ deployments.
-
Cache aggressively - 80%+ of requests can be served from cache. Redis is your best friend.
-
Design for failure - Services will fail. Plan for graceful degradation and quick recovery.
-
Automate operations - Auto-scaling, automated backups, infrastructure as code. Manual processes don’t scale.
-
Cost optimization - Scaling isn’t just about performance. A 10% efficiency gain at scale saves $50K/year.
-
Team structure mirrors architecture - Monolith works for small teams. Microservices need organizational boundaries.
Remember: Every large-scale system started as a simple application. Focus on solving user problems first, then scale when necessary. Premature optimization is the root of all evil.
Comments