Scale From Zero To Millions Of Users

This comprehensive guide explores the architectural evolution required to scale a system from zero to millions of users. We’ll examine real-world capacity planning, specific technologies, bottleneck analysis, and production considerations used at companies like Meta, Netflix, and Amazon. Each stage includes concrete numbers, technology choices, and tradeoffs that staff engineers encounter in production environments.

Scaling Fundamentals

Scaling refers to the ability of a system to handle growing workloads while maintaining performance, availability, and cost efficiency. There are two primary approaches:

Vertical Scaling (Scale Up):

  • Increasing single server capacity (CPU, RAM, disk, network)
  • Typical progression: 2 cores/4GB → 8 cores/32GB → 96 cores/768GB
  • AWS EC2 limits: up to 448 vCPUs and 24TB RAM (u-24tb1.metal)
  • Advantages: Simple, no code changes, strong consistency
  • Disadvantages: Hardware limits (ceiling around $50K/month), single point of failure, downtime during upgrades
  • Cost: Linear to exponential (doubling RAM costs 2-3x more)

Horizontal Scaling (Scale Out):

  • Adding more servers to distribute load
  • Typical progression: 1 → 3 → 10 → 100 → 1000+ servers
  • Advantages: No theoretical limit, fault tolerance, cost-effective at scale
  • Disadvantages: Complex architecture, eventual consistency, network overhead
  • Cost: Linear (2x servers = 2x cost, but handles 2x+ traffic with efficiency)

Key Principle: Use vertical scaling initially (simpler), transition to horizontal scaling as you approach 10K+ concurrent users.

Stage 1: Single Server (0-1K Users)

Architecture

┌─────────────────────────────────┐
│     Single Server (4 cores)     │
│  ┌──────────────────────────┐   │
│  │   Web Server (Nginx)     │   │
│  │   App Server (Node.js)   │   │
│  │   Database (PostgreSQL)  │   │
│  │   Static Files           │   │
│  └──────────────────────────┘   │
└─────────────────────────────────┘

Capacity Analysis

  • Hardware: 4 vCPUs, 16GB RAM, 200GB SSD
  • Cost: $80-160/month (AWS t3.xlarge, GCP n2-standard-4)
  • Capacity: 10-50 req/sec, 1K daily active users
  • Database: Single PostgreSQL instance, 10-20GB data
  • Response Time: 50-200ms average

Technology Stack

  • Web Server: Nginx (handles 10K connections per core)
  • Application: Node.js, Python Django, Ruby Rails
  • Database: PostgreSQL (ACID compliance) or MySQL
  • OS: Ubuntu 22.04 LTS, Amazon Linux 2023

Bottlenecks

  1. CPU contention between app and DB during traffic spikes
  2. Disk I/O becomes bottleneck at 1K+ writes/sec
  3. Memory pressure when DB + app exceed 16GB
  4. Network limits on single network interface (1-5 Gbps)

When to Evolve

  • Response times consistently exceed 500ms
  • CPU utilization sustained above 70%
  • Database queries taking >100ms
  • Downtime affects revenue/reputation

Stage 2: Separate Database (1K-10K Users)

Architecture

                  ┌──────────────────────┐
    Internet ───→ │   Web/App Server     │
                  │   (8 cores, 32GB)    │
                  └──────────┬───────────┘

                             │ Private Network
                             │ (10 Gbps)

                  ┌──────────────────────┐
                  │   Database Server    │
                  │   PostgreSQL         │
                  │   (8 cores, 64GB)    │
                  │   500GB SSD          │
                  └──────────────────────┘

Capacity Analysis

  • Web/App Server: 8 vCPUs, 32GB RAM, handles 100-200 req/sec
  • Database Server: 8 vCPUs, 64GB RAM (50% for DB cache), 500GB SSD
  • Combined Cost: $400-600/month
  • Capacity: 100K-500K daily active users
  • Database: 50-200GB data, 500-1K queries/sec

Key Improvements

  1. Independent Scaling: Scale web and DB separately based on bottlenecks
  2. Resource Isolation: DB gets dedicated memory for caching (shared_buffers)
  3. Network Optimization: 10 Gbps private network reduces latency to <1ms
  4. Backup Strategy: Daily full backups, hourly incrementals (500GB = 30min backup)

Database Optimization

-- PostgreSQL tuning for 64GB RAM server
shared_buffers = 16GB          -- 25% of RAM
effective_cache_size = 48GB    -- 75% of RAM
work_mem = 128MB               -- Per query operation
maintenance_work_mem = 2GB     -- For VACUUM, CREATE INDEX
max_connections = 200          -- Application pool size

-- Indexes for common queries
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
CREATE INDEX idx_posts_created_at ON posts(created_at DESC);

Bottlenecks

  1. Single point of failure - DB downtime = total outage
  2. Read-heavy workloads - 80% reads, 20% writes is typical
  3. Connection pooling - Max 200 connections becomes limit
  4. Query performance - Complex joins slow down at 100M+ rows

Vertical Scaling Runway

  • Can scale to 96 cores, 768GB RAM (AWS r7g.metal)
  • Handles up to 10K req/sec with proper indexing
  • Cost: $5K-10K/month at maximum vertical scale

Stage 3: Load Balancer + Multiple Web Servers (10K-100K Users)

Architecture

                         ┌──────────────────┐
    Internet ──────────→ │  Load Balancer   │
                         │  (HAProxy/ALB)   │
                         └────────┬─────────┘

           ┌──────────────────────┼──────────────────────┐
           │                      │                      │
           ↓                      ↓                      ↓
    ┌──────────┐          ┌──────────┐          ┌──────────┐
    │  Web-1   │          │  Web-2   │          │  Web-3   │
    │  8 cores │          │  8 cores │          │  8 cores │
    └────┬─────┘          └────┬─────┘          └────┬─────┘
         │                     │                     │
         └─────────────────────┼─────────────────────┘


                    ┌──────────────────┐
                    │   Database (RW)  │
                    │   16 cores       │
                    └──────────────────┘

Capacity Analysis

  • Load Balancer: AWS ALB (50K connections, 3K req/sec per AZ)
  • Web Servers: 3-5 instances, each handles 150-200 req/sec
  • Total Capacity: 600-1000 req/sec, 5M-10M requests/day
  • Cost: $1K-2K/month (LB: $20 + 5x$160)
  • Availability: 99.9% (single AZ) to 99.99% (multi-AZ)

Load Balancing Strategies

1. Round Robin

Request 1 → Web-1
Request 2 → Web-2
Request 3 → Web-3
Request 4 → Web-1 (cycles back)
  • Simple, even distribution
  • Doesn’t account for server load differences

2. Least Connections

Web-1: 45 active connections → Choose this
Web-2: 67 active connections
Web-3: 52 active connections
  • Better for long-lived connections (WebSockets)
  • Overhead: LB must track connection counts

3. IP Hash

hash(client_ip) % num_servers = target_server
  • Session affinity without sticky sessions
  • Problem: Uneven distribution if traffic from NAT/proxy

4. Weighted Round Robin (Production Standard)

Web-1: weight=3 (newer, 16 cores)
Web-2: weight=2 (older, 8 cores)
Web-3: weight=2 (older, 8 cores)

Session Management (Critical for Stateless Design)

Option 1: Sticky Sessions

upstream backend {
    ip_hash;
    server web1.example.com;
    server web2.example.com;
}
  • Pros: Simple, no code changes
  • Cons: Uneven load, loses sessions on server failure

Option 2: Centralized Session Store (Recommended)

// Node.js with Redis sessions
const session = require('express-session');
const RedisStore = require('connect-redis')(session);

app.use(session({
    store: new RedisStore({
        host: 'redis-cluster.internal',
        port: 6379,
        ttl: 86400  // 24 hours
    }),
    secret: 'keyboard-cat',
    resave: false
}));
  • Stores sessions in Redis cluster
  • Any web server can handle any request
  • Session lookup: 1-2ms (vs 50ms database query)

Health Checks and Auto-Recovery

# AWS ALB Target Group Health Check
HealthCheckProtocol: HTTP
HealthCheckPath: /health
HealthCheckIntervalSeconds: 30
HealthyThresholdCount: 2        # 2 successes = healthy
UnhealthyThresholdCount: 3      # 3 failures = unhealthy
Timeout: 5

Application Health Endpoint:

app.get('/health', async (req, res) => {
    const checks = {
        database: await checkDatabaseConnection(),
        redis: await checkRedisConnection(),
        diskSpace: checkDiskSpace() > 10 // >10% free
    };

    const healthy = Object.values(checks).every(v => v === true);
    res.status(healthy ? 200 : 503).json({
        status: healthy ? 'healthy' : 'unhealthy',
        checks
    });
});

Bottlenecks

  1. Database becomes bottleneck - All web servers hit single DB
  2. No read scaling - Master handles all reads and writes
  3. Database CPU at 70-80% during peak hours
  4. Lock contention on hot tables (users, orders)

Stage 4: Database Replication - Master-Slave (100K-500K Users)

Architecture

           Load Balancer

        ┌────────┼────────┐
        │        │        │
      Web-1    Web-2    Web-3
        │        │        │
        └────────┼────────┘

         ┌───────┴───────┐
         │               │
    ┌────▼────┐    ┌─────▼─────┐
    │ Master  │───→│  Replica-1 │ (Async Replication)
    │ (Write) │    │   (Read)   │
    │         │───→│  Replica-2 │
    │         │    │   (Read)   │
    └─────────┘    └────────────┘
       16 cores     8 cores each

Capacity Analysis

  • Master DB: 16 cores, 128GB RAM, 1TB SSD (handles all writes)
  • Read Replicas: 2-5 instances, 8 cores, 64GB RAM each
  • Write Capacity: 2K-5K writes/sec (limited by master)
  • Read Capacity: 10K-50K reads/sec (distributed across replicas)
  • Typical Ratio: 80% reads, 20% writes
  • Cost: $2K-4K/month (Master: $800, Replicas: $400 each)

Replication Configuration

PostgreSQL Streaming Replication:

# postgresql.conf (Master)
wal_level = replica
max_wal_senders = 5
wal_keep_size = 1GB
synchronous_commit = off  # Async for performance

# recovery.conf (Replica)
primary_conninfo = 'host=master-db port=5432 user=replicator'
hot_standby = on

MySQL Replication:

-- Master configuration
SET GLOBAL server_id = 1;
SET GLOBAL binlog_format = 'ROW';  -- Safer than STATEMENT
SET GLOBAL sync_binlog = 1;        -- Durability

-- On Replica
CHANGE MASTER TO
    MASTER_HOST='master-db.internal',
    MASTER_USER='replicator',
    MASTER_PASSWORD='secure_password',
    MASTER_LOG_FILE='mysql-bin.000001',
    MASTER_LOG_POS=107;
START SLAVE;

Application-Level Read/Write Splitting

Approach 1: Manual Routing

# Django example
class UserManager(models.Manager):
    def get_queryset(self):
        if self._db == 'default':
            return super().get_queryset().using('master')
        return super().get_queryset().using('replica')

# Usage
user = User.objects.using('replica').get(id=123)  # Read from replica
user.email = 'new@email.com'
user.save(using='master')  # Write to master

Approach 2: Automatic Routing (Recommended)

# Database router
class ReadReplicaRouter:
    def db_for_read(self, model, **hints):
        return random.choice(['replica1', 'replica2', 'replica3'])

    def db_for_write(self, model, **hints):
        return 'master'

# settings.py
DATABASE_ROUTERS = ['myapp.routers.ReadReplicaRouter']
DATABASES = {
    'master': {
        'ENGINE': 'django.db.backends.postgresql',
        'HOST': 'master-db.internal',
        'NAME': 'production',
    },
    'replica1': {
        'ENGINE': 'django.db.backends.postgresql',
        'HOST': 'replica1-db.internal',
        'NAME': 'production',
    },
    'replica2': { ... }
}

Replication Lag Management

Monitoring Replication Lag:

-- PostgreSQL
SELECT
    client_addr,
    state,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
    EXTRACT(EPOCH FROM (NOW() - pg_last_xact_replay_timestamp())) AS lag_seconds
FROM pg_stat_replication;

Handling Replication Lag in Application:

async function getUserProfile(userId) {
    // Write user activity
    await masterDB.query('UPDATE users SET last_seen = NOW() WHERE id = $1', [userId]);

    // Critical read: use master to avoid stale data
    const user = await masterDB.query('SELECT * FROM users WHERE id = $1', [userId]);

    // Non-critical read: use replica (may be 100-500ms behind)
    const posts = await replicaDB.query('SELECT * FROM posts WHERE user_id = $1', [userId]);

    return { user, posts };
}

Production Consideration:

  • Acceptable lag: 100ms-1s for most applications
  • Alert threshold: >5 seconds (indicates problem)
  • Failover threshold: >30 seconds (promote replica to master)

Failover Strategy

Automatic Failover with AWS RDS:

Master fails → RDS detects (30-60 seconds)
              → Promotes read replica to master (60-120 seconds)
              → Updates DNS endpoint (30 seconds)
Total downtime: 2-3 minutes

Manual Failover:

# Promote replica to master
pg_ctl promote -D /var/lib/postgresql/data

# Update application config
# Point writes to new master
# Old master (if recovered) becomes replica

Bottlenecks

  1. Master write capacity - Single master limits write throughput
  2. Replication lag - Read replicas 100ms-1s behind master
  3. Storage growth - 1TB fills up in 6-12 months at 100GB/month
  4. Hot partitions - Some tables grow to 100M+ rows

Stage 5: Caching Layer (500K-2M Users)

Architecture

         Load Balancer

         ┌─────┼─────┐
         │     │     │
       Web-1 Web-2 Web-3
         │     │     │
         └─────┼─────┘

         ┌─────┴─────┐
         │           │
    ┌────▼────┐  ┌──▼────┐
    │  Redis  │  │  DB   │
    │ Cluster │  │Master │
    │ (Cache) │  │  +    │
    └─────────┘  │Replica│
                 └───────┘

Redis Cache Architecture

Single Instance (Development):

  • Capacity: 10K-50K ops/sec
  • Memory: 8-32GB
  • Use case: Small apps, development

Redis Cluster (Production):

┌─────────────────────────────────────┐
│         Redis Cluster               │
│  ┌──────┐  ┌──────┐  ┌──────┐      │
│  │Master│  │Master│  │Master│      │
│  │ 0-5K │  │5K-10K│  │10K-16│      │
│  └──┬───┘  └──┬───┘  └──┬───┘      │
│     │         │         │           │
│  ┌──▼───┐  ┌─▼────┐  ┌─▼────┐     │
│  │Slave │  │Slave │  │Slave │      │
│  └──────┘  └──────┘  └──────┘      │
└─────────────────────────────────────┘

Configuration:

# redis.conf for production
maxmemory 64gb
maxmemory-policy allkeys-lru  # Evict least recently used
save ""                        # Disable RDB snapshots (use AOF)
appendonly yes                 # Enable AOF persistence
appendfsync everysec           # Fsync every second (balance)

# Cluster mode
cluster-enabled yes
cluster-node-timeout 5000
cluster-replica-validity-factor 0

Capacity Analysis

  • Redis Cluster: 6 nodes (3 masters + 3 replicas), 64GB each
  • Cache Hit Ratio: 70-90% (target: >85%)
  • Throughput: 500K-1M ops/sec
  • Latency: <1ms (P50), <5ms (P99)
  • Cost: $1K-2K/month (AWS ElastiCache r6g.2xlarge)

Caching Strategies

1. Cache-Aside (Lazy Loading)

def get_user(user_id):
    # Try cache first
    cached_user = redis.get(f'user:{user_id}')
    if cached_user:
        return json.loads(cached_user)  # Cache hit

    # Cache miss: query database
    user = db.query('SELECT * FROM users WHERE id = %s', [user_id])

    # Store in cache (TTL: 1 hour)
    redis.setex(f'user:{user_id}', 3600, json.dumps(user))

    return user

2. Write-Through Cache

def update_user(user_id, data):
    # Update database
    db.query('UPDATE users SET email = %s WHERE id = %s', [data['email'], user_id])

    # Update cache immediately
    user = db.query('SELECT * FROM users WHERE id = %s', [user_id])
    redis.setex(f'user:{user_id}', 3600, json.dumps(user))

    return user

3. Write-Behind Cache (Write-Back)

# Buffer writes in Redis, async flush to DB
def update_user_async(user_id, data):
    # Update cache immediately
    redis.hset(f'user:{user_id}', mapping=data)
    redis.sadd('dirty_users', user_id)  # Mark for flush

    # Background worker flushes every 5 seconds
    # Benefit: 10x faster writes, risk: potential data loss

Cache Invalidation Strategies

Time-Based Expiration:

# TTL based on data change frequency
redis.setex('trending_posts', 60, data)      # 1 min (changes frequently)
redis.setex('user_profile:123', 3600, data)  # 1 hour (moderate)
redis.setex('country_list', 86400, data)     # 24 hours (rarely changes)

Event-Based Invalidation:

def update_user_email(user_id, new_email):
    # Update database
    db.query('UPDATE users SET email = %s WHERE id = %s', [new_email, user_id])

    # Invalidate cache
    redis.delete(f'user:{user_id}')
    redis.delete(f'user:email:{old_email}')  # Secondary index

Cache Stampede Prevention:

import threading

def get_expensive_data(key):
    # Try cache
    data = redis.get(key)
    if data:
        return data

    # Use lock to prevent multiple simultaneous DB queries
    lock_key = f'{key}:lock'
    if redis.set(lock_key, '1', nx=True, ex=10):  # 10 sec lock
        try:
            # This thread computes and caches
            data = expensive_database_query()
            redis.setex(key, 3600, data)
            return data
        finally:
            redis.delete(lock_key)
    else:
        # Other threads wait for cache to populate
        time.sleep(0.1)
        return get_expensive_data(key)  # Retry

What to Cache

High-Value Cache Targets:

  1. User sessions (100% cache hit rate)
  2. User profiles (90% hit rate, 1-hour TTL)
  3. Product catalog (85% hit rate, 5-min TTL)
  4. API responses (70% hit rate, 1-min TTL)
  5. Computed aggregations (95% hit rate, 10-min TTL)

Don’t Cache:

  • Financial transactions (requires DB accuracy)
  • Real-time inventory (stale data causes overselling)
  • Personalized recommendations (too many unique keys)

Cache Performance Metrics

# Monitor cache effectiveness
cache_hits = redis.info('stats')['keyspace_hits']
cache_misses = redis.info('stats')['keyspace_misses']
hit_ratio = cache_hits / (cache_hits + cache_misses)

# Alert if hit ratio < 80%
# Indicates poor caching strategy or TTL tuning needed

Bottlenecks After Caching

  1. Static assets (images, CSS, JS) still served from origin
  2. Geographic latency for international users (100-300ms)
  3. Network bandwidth costs increasing (500GB-5TB/month)
  4. Cache memory pressure at 80-90% capacity

Stage 6: CDN for Static Assets (2M-5M Users)

Architecture

                    ┌──────────────────┐
    User (US)  ────→│  CDN Edge (US)   │ (Cache Hit: 95%)
                    └────────┬─────────┘
                             │ (5% cache miss)

                    ┌────────────────────┐
    User (EU)  ────→│  CDN Edge (EU)     │
                    └────────┬───────────┘


                    ┌────────────────────┐
                    │   Origin Server     │
                    │  (Your Web Server)  │
                    └─────────────────────┘

CDN Capacity Analysis

  • CDN Providers: CloudFront, Cloudflare, Fastly, Akamai
  • Edge Locations: 200-400 global points of presence (PoPs)
  • Cache Hit Ratio: 85-95% for static assets
  • Latency Reduction: 200ms → 20-50ms (4-10x improvement)
  • Bandwidth Offload: 70-90% of traffic served from edge
  • Cost: $0.08-0.15/GB (decreases with volume commits)

CDN Configuration

CloudFront Distribution:

{
  "Origins": [{
    "DomainName": "origin.example.com",
    "CustomHeaders": [{
      "HeaderName": "X-Origin-Verify",
      "HeaderValue": "secret-token"
    }]
  }],
  "DefaultCacheBehavior": {
    "ViewerProtocolPolicy": "redirect-to-https",
    "AllowedMethods": ["GET", "HEAD", "OPTIONS"],
    "CachedMethods": ["GET", "HEAD"],
    "Compress": true,
    "DefaultTTL": 86400,      // 24 hours
    "MinTTL": 0,
    "MaxTTL": 31536000,       // 1 year
    "ForwardedValues": {
      "QueryString": false,
      "Cookies": { "Forward": "none" }
    }
  }
}

Cache-Control Headers:

// Express.js middleware for static assets
app.use('/static', express.static('public', {
    maxAge: '1y',  // Browser + CDN cache for 1 year
    immutable: true
}));

// Dynamic content with versioned URLs
app.get('/api/config', (req, res) => {
    res.set('Cache-Control', 'public, max-age=300');  // 5 min CDN cache
    res.json({ version: '2.0', features: [] });
});

// User-specific content
app.get('/api/profile', authenticate, (req, res) => {
    res.set('Cache-Control', 'private, no-store');  // No caching
    res.json({ user: req.user });
});

Asset Optimization

Image Optimization:

// Next.js Image component (automatic optimization)
<Image
  src="/large-photo.jpg"
  width={800}
  height={600}
  formats={['image/avif', 'image/webp']}  // Modern formats
  quality={85}  // 85% quality (optimal tradeoff)
/>

// Savings: 3MB JPEG → 300KB WebP (10x reduction)

Code Splitting and Bundling:

// Webpack code splitting
import(/* webpackChunkName: "dashboard" */ './Dashboard')
  .then(module => {
    const Dashboard = module.default;
    // Load dashboard code only when needed
  });

// Result:
// - Initial bundle: 500KB → 100KB
// - Dashboard chunk: 400KB (loaded on demand)

CDN Cache Invalidation

Versioned URLs (Recommended):

<!-- Cache forever, change URL for new version -->
<script src="/static/js/app.v2.7.3.min.js"></script>
<link rel="stylesheet" href="/static/css/main.abc123.css">

Explicit Invalidation:

# CloudFront invalidation (costs $0.005 per path)
aws cloudfront create-invalidation \
  --distribution-id E1234567890 \
  --paths "/css/*" "/js/*"

# Cloudflare purge (free, instant)
curl -X POST "https://api.cloudflare.com/client/v4/zones/{zone_id}/purge_cache" \
  -H "Authorization: Bearer {token}" \
  -d '{"files":["https://example.com/css/main.css"]}'

Multi-Tier Caching Strategy

Browser Cache (1 hour)
    ↓ miss
CDN Edge Cache (24 hours)
    ↓ miss
CDN Shield/Origin (no cache)
    ↓ miss
Application Server (Redis: 1 hour)
    ↓ miss
Database

Effective TTL Stacking:

  • HTML: 5 min (frequently updated)
  • CSS/JS: 1 year (versioned URLs)
  • Images: 1 week (semi-static)
  • API responses: 1-5 min (data freshness)

Cost Optimization

Without CDN:
- Origin bandwidth: 10TB/month × $0.09/GB = $900

With CDN (90% cache hit rate):
- CDN bandwidth: 10TB × $0.085/GB = $850
- Origin bandwidth: 1TB × $0.09/GB = $90
- Total: $940

Net effect: Similar cost, but 10x better performance
At scale (100TB/month): $5K with CDN vs $9K without

Stage 7: Database Sharding (5M-20M Users)

When to Shard

  • Single database exceeds 1TB (query performance degrades)
  • Write throughput exceeds 10K writes/sec (master bottleneck)
  • Replication lag consistently >5 seconds
  • Tables with >500M rows causing slow queries even with indexes

Sharding Strategies

1. Horizontal Partitioning by User ID (Hash-Based)

Users Table (100M rows):
┌──────────┬───────────────┐
│  Shard 1 │ user_id % 4=0│
│  (25M)   │  0,4,8,12...  │
├──────────┼───────────────┤
│  Shard 2 │ user_id % 4=1│
│  (25M)   │  1,5,9,13...  │
├──────────┼───────────────┤
│  Shard 3 │ user_id % 4=2│
│  (25M)   │  2,6,10,14... │
├──────────┼───────────────┤
│  Shard 4 │ user_id % 4=3│
│  (25M)   │  3,7,11,15... │
└──────────┴───────────────┘

Implementation:

def get_shard_for_user(user_id, num_shards=4):
    return user_id % num_shards

def get_user_posts(user_id):
    shard_id = get_shard_for_user(user_id)
    db = get_database_connection(f'shard_{shard_id}')
    return db.query('SELECT * FROM posts WHERE user_id = %s', [user_id])

2. Range-Based Sharding (Geographic)

┌────────────┬──────────────────┐
│  Shard US  │  country='US'    │
│  (40M)     │  lat: 25-50°N    │
├────────────┼──────────────────┤
│  Shard EU  │  country IN(...) │
│  (30M)     │  lat: 35-60°N    │
├────────────┼──────────────────┤
│  Shard ASIA│  country IN(...) │
│  (20M)     │  lat: 0-50°N     │
└────────────┴──────────────────┘

3. Directory-Based Sharding (Flexible)

┌──────────────────────────────┐
│    Shard Lookup Service      │
│  user_id → shard_id mapping  │
└──────────────────────────────┘

         ├─→ Shard 1: users 0-10M
         ├─→ Shard 2: users 10M-20M
         ├─→ Shard 3: users 20M-30M
         └─→ Shard 4: VIP users (premium tier)

Cross-Shard Queries Challenge

Problem:

-- This query requires scanning ALL shards
SELECT * FROM posts
WHERE created_at > '2024-01-01'
ORDER BY created_at DESC
LIMIT 100;

Solution 1: Scatter-Gather

def get_recent_posts_global(limit=100):
    shards = get_all_shards()
    futures = []

    # Query all shards in parallel
    with ThreadPoolExecutor(max_workers=8) as executor:
        for shard in shards:
            future = executor.submit(
                shard.query,
                'SELECT * FROM posts WHERE created_at > %s ORDER BY created_at DESC LIMIT %s',
                ['2024-01-01', limit]
            )
            futures.append(future)

    # Merge results
    all_posts = []
    for future in futures:
        all_posts.extend(future.result())

    # Sort and limit
    return sorted(all_posts, key=lambda x: x['created_at'], reverse=True)[:limit]

Solution 2: Denormalization (Recommended)

- Maintain separate "global feed" table in dedicated database
- Use Change Data Capture (CDC) to populate from shards
- Trade consistency for query performance

Shard Rebalancing

Adding New Shard (Consistent Hashing):

# Before: 4 shards (0-3)
shard = user_id % 4

# After: 5 shards (0-4)
shard = user_id % 5

# Problem: 80% of keys need to move!
# Solution: Consistent hashing

Consistent Hashing Implementation:

import hashlib

class ConsistentHash:
    def __init__(self, nodes, virtual_nodes=150):
        self.ring = {}
        self.sorted_keys = []
        self.virtual_nodes = virtual_nodes

        for node in nodes:
            self.add_node(node)

    def add_node(self, node):
        for i in range(self.virtual_nodes):
            key = self._hash(f'{node}:{i}')
            self.ring[key] = node
            self.sorted_keys.append(key)
        self.sorted_keys.sort()

    def get_node(self, key):
        if not self.ring:
            return None
        hash_key = self._hash(key)
        # Binary search for nearest node
        for ring_key in self.sorted_keys:
            if hash_key <= ring_key:
                return self.ring[ring_key]
        return self.ring[self.sorted_keys[0]]

    def _hash(self, key):
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

# Usage
shards = ConsistentHash(['shard1', 'shard2', 'shard3', 'shard4'])
shard = shards.get_node(f'user:{user_id}')

# Add new shard: only ~20% of keys need to move
shards.add_node('shard5')

Shard Management at Scale (10+ Shards)

Vitess (MySQL Sharding):

# Vitess topology
keyspaces:
  - name: users
    sharded: true
    shards:
      - name: "-80"    # Hash range 0x00-0x80
        master: shard1-master
        replicas: [shard1-replica1, shard1-replica2]
      - name: "80-"    # Hash range 0x80-0xFF
        master: shard2-master
        replicas: [shard2-replica1, shard2-replica2]

Citus (PostgreSQL Sharding):

-- Designate distributed tables
SELECT create_distributed_table('posts', 'user_id');
SELECT create_distributed_table('comments', 'user_id');

-- Citus automatically routes queries to correct shard
SELECT * FROM posts WHERE user_id = 12345;

Sharding Tradeoffs

Benefits:

  • Near-linear write scalability (4 shards = 4x write capacity)
  • Reduced blast radius (one shard failure affects 25% of users)
  • Cost optimization (smaller databases are cheaper to maintain)

Challenges:

  • Application complexity (shard routing logic)
  • Cross-shard transactions require 2PC (slow, avoid if possible)
  • Resharding is expensive (hours to days of migration)
  • Monitoring complexity (4x databases to monitor)

Production Capacity

  • Per Shard: 16 cores, 128GB RAM, 1TB SSD
  • Total Capacity: 4-16 shards = 40K-150K writes/sec
  • Data Size: 4-16TB total (1TB per shard)
  • Cost: $10K-40K/month depending on shard count

Stage 8: Message Queues and Async Processing (20M+ Users)

Architecture

     Web Servers

          ├──→ [Sync Request] ──→ Database

          └──→ [Async Task] ──→ Message Queue

                  ┌───────────────────┼───────────────┐
                  ↓                   ↓               ↓
              Worker-1           Worker-2        Worker-3
              (Email)         (Image Process)   (Analytics)

When to Use Async Processing

  1. Long-running tasks (>1 second response time)
  2. Non-critical operations (can tolerate delay)
  3. Resource-intensive (CPU/memory heavy)
  4. Third-party API calls (unreliable, slow)
  5. Batch operations (bulk updates, reports)

Message Queue Technologies

RabbitMQ (Traditional):

  • Throughput: 10K-50K messages/sec per node
  • Latency: 5-20ms
  • Persistence: Durable queues with disk backing
  • Features: Complex routing, priority queues
  • Cost: $200-500/month (3-node cluster)

Amazon SQS (Serverless):

  • Throughput: Unlimited (managed service)
  • Latency: 10-50ms
  • Persistence: 14-day retention
  • Features: Dead-letter queues, FIFO queues
  • Cost: $0.40 per million requests ($400 for 1B messages)

Apache Kafka (High-Throughput):

  • Throughput: 1M+ messages/sec per cluster
  • Latency: <10ms (P99)
  • Persistence: Days to weeks (configurable)
  • Features: Pub/sub, stream processing, replay
  • Cost: $1K-5K/month (3-node cluster)

Implementation Examples

Email Processing with SQS:

# Producer (web server)
import boto3

sqs = boto3.client('sqs')
queue_url = 'https://sqs.us-east-1.amazonaws.com/123456/email-queue'

@app.post('/register')
def register_user(email, password):
    # Synchronous: create user in database
    user = User.create(email=email, password=hash_password(password))

    # Asynchronous: send welcome email (non-blocking)
    sqs.send_message(
        QueueUrl=queue_url,
        MessageBody=json.dumps({
            'user_id': user.id,
            'email': email,
            'template': 'welcome'
        })
    )

    return {'status': 'success', 'user_id': user.id}

# Consumer (worker process)
while True:
    messages = sqs.receive_message(QueueUrl=queue_url, MaxNumberOfMessages=10)

    for message in messages.get('Messages', []):
        try:
            data = json.loads(message['Body'])
            send_email(data['email'], template=data['template'])

            # Delete message after processing
            sqs.delete_message(
                QueueUrl=queue_url,
                ReceiptHandle=message['ReceiptHandle']
            )
        except Exception as e:
            # Message will return to queue after visibility timeout
            logger.error(f'Failed to process message: {e}')

Image Processing with Kafka:

# Producer
from kafka import KafkaProducer

producer = KafkaProducer(
    bootstrap_servers=['kafka1:9092', 'kafka2:9092'],
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

@app.post('/upload')
def upload_image(file):
    # Save original image to S3
    s3_key = upload_to_s3(file)

    # Queue thumbnail generation
    producer.send('image-processing', {
        'image_id': generate_id(),
        's3_key': s3_key,
        'operations': ['thumbnail', 'compress', 'watermark']
    })

    return {'status': 'processing', 's3_key': s3_key}

# Consumer
from kafka import KafkaConsumer
from PIL import Image

consumer = KafkaConsumer(
    'image-processing',
    bootstrap_servers=['kafka1:9092', 'kafka2:9092'],
    group_id='image-workers',
    auto_offset_reset='earliest'
)

for message in consumer:
    data = json.loads(message.value)

    # Download image
    image = download_from_s3(data['s3_key'])

    # Process
    if 'thumbnail' in data['operations']:
        thumbnail = Image.open(image).resize((200, 200))
        upload_to_s3(thumbnail, f"{data['image_id']}_thumb.jpg")

    # Takes 5-10 seconds, doesn't block web server

Queue Patterns

1. Task Queue (Work Distribution)

Producer → Queue → [Worker-1, Worker-2, Worker-3]
                    (Compete for tasks)

2. Pub/Sub (Fanout)

Publisher → Topic → [Subscriber-A: Email Service]
                  → [Subscriber-B: Analytics]
                  → [Subscriber-C: Audit Log]

3. Priority Queue

# RabbitMQ with priority
channel.basic_publish(
    exchange='',
    routing_key='tasks',
    body=message,
    properties=pika.BasicProperties(priority=5)  # 0-10, higher = more urgent
)

Retry and Error Handling

Exponential Backoff:

def process_with_retry(task, max_retries=3):
    for attempt in range(max_retries):
        try:
            return execute_task(task)
        except Exception as e:
            if attempt == max_retries - 1:
                # Send to dead-letter queue
                dlq.send(task)
                logger.error(f'Task failed after {max_retries} attempts: {e}')
                return

            # Exponential backoff: 1s, 2s, 4s, 8s...
            sleep_time = 2 ** attempt
            time.sleep(sleep_time)

Dead-Letter Queue (DLQ):

# SQS DLQ configuration
{
    "RedrivePolicy": {
        "deadLetterTargetArn": "arn:aws:sqs:us-east-1:123456:failed-tasks",
        "maxReceiveCount": 3  # After 3 failures, move to DLQ
    }
}

# Monitor DLQ for issues
def monitor_dlq():
    dlq_size = sqs.get_queue_attributes(
        QueueUrl=dlq_url,
        AttributeNames=['ApproximateNumberOfMessages']
    )

    if dlq_size > 100:
        alert_ops_team('High DLQ size, investigate failures')

Capacity Planning

  • Queue Size: Monitor queue depth (alert if >10K messages)
  • Worker Scaling: Auto-scale based on queue length
  • Throughput: 10K-100K tasks/sec with proper parallelization
  • Cost: $500-2K/month (queue + workers)

Stage 9: Microservices Architecture (50M+ Users)

Monolith to Microservices Evolution

Monolith (Simple, Single Deployment):

┌───────────────────────────────────┐
│      Monolithic Application       │
│  ┌─────────────────────────────┐  │
│  │ Auth │ Users │ Posts │ Pay  │  │
│  └─────────────────────────────┘  │
│        Single Database            │
└───────────────────────────────────┘

Microservices (Distributed, Independent):

┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐
│   Auth     │  │   Users    │  │   Posts    │  │  Payment   │
│  Service   │  │  Service   │  │  Service   │  │  Service   │
│    ↓       │  │    ↓       │  │    ↓       │  │    ↓       │
│  Auth DB   │  │  Users DB  │  │  Posts DB  │  │  Pay DB    │
└────────────┘  └────────────┘  └────────────┘  └────────────┘

When to Adopt Microservices

  • Team size >50 engineers (coordination overhead)
  • Monolith >500K lines of code (deploy takes >30 min)
  • Need independent scaling (payment spikes during Black Friday)
  • Different tech stacks (ML service in Python, core in Java)
  • Regulatory requirements (isolate PCI-compliant services)

Service Decomposition Strategy

By Business Domain (DDD):

┌──────────────────┐
│  User Service    │ - Authentication, profiles, preferences
│  (10 endpoints)  │ - Team: 8 engineers
└──────────────────┘

┌──────────────────┐
│  Content Service │ - Posts, comments, likes
│  (15 endpoints)  │ - Team: 12 engineers
└──────────────────┘

┌──────────────────┐
│ Payment Service  │ - Subscriptions, transactions, billing
│  (8 endpoints)   │ - Team: 6 engineers (PCI compliance)
└──────────────────┘

Service Communication

1. Synchronous (REST API):

# User service calls Payment service
import requests

def get_user_with_subscription(user_id):
    user = db.query('SELECT * FROM users WHERE id = %s', [user_id])

    # HTTP call to payment service
    response = requests.get(
        f'http://payment-service.internal/subscriptions/{user_id}',
        timeout=2  # Critical: set timeouts
    )

    if response.status_code == 200:
        user['subscription'] = response.json()

    return user

Problem: Cascading Failures

User Service (99.9% uptime)
    ↓ calls
Payment Service (99.9% uptime)
    ↓ calls
Email Service (99.9% uptime)

Combined uptime: 99.9% × 99.9% × 99.9% = 99.7%
More dependencies = worse reliability

2. Asynchronous (Event-Driven):

# User service publishes event
from kafka import KafkaProducer

producer = KafkaProducer(bootstrap_servers=['kafka:9092'])

def create_user(email, password):
    user = User.create(email=email, password=password)

    # Publish event (fire and forget)
    producer.send('user.created', {
        'user_id': user.id,
        'email': email,
        'timestamp': datetime.utcnow().isoformat()
    })

    return user

# Payment service subscribes to events
consumer = KafkaConsumer('user.created', group_id='payment-service')

for message in consumer:
    event = json.loads(message.value)

    # Create free trial subscription
    create_trial_subscription(event['user_id'])

Benefits:

  • Loose coupling (services don’t directly depend on each other)
  • Better fault tolerance (payment service down doesn’t block user creation)
  • Audit trail (event log is source of truth)

API Gateway Pattern

                      ┌──────────────────┐
    Client ─────────→ │   API Gateway    │
                      │   (Kong/AWS ALB) │
                      └────────┬─────────┘

         ┌─────────────────────┼─────────────────────┐
         │                     │                     │
         ↓                     ↓                     ↓
    ┌─────────┐          ┌─────────┐          ┌─────────┐
    │  Auth   │          │  Users  │          │  Posts  │
    │ Service │          │ Service │          │ Service │
    └─────────┘          └─────────┘          └─────────┘

API Gateway Responsibilities:

  1. Authentication/Authorization
  2. Rate limiting (1000 req/min per user)
  3. Request routing (/api/users/* → User Service)
  4. Response aggregation (combine multiple service calls)
  5. Protocol translation (HTTP → gRPC)

Kong Configuration:

services:
  - name: user-service
    url: http://users.internal:8080
    routes:
      - name: user-routes
        paths:
          - /api/users
    plugins:
      - name: rate-limiting
        config:
          minute: 1000
      - name: jwt
        config:
          claims_to_verify:
            - exp

Service Mesh (Advanced)

Istio/Linkerd for Service-to-Service Communication:

┌──────────┐         ┌──────────┐
│  Service │         │  Service │
│    A     │ ←────→  │    B     │
│   ↕️     │         │   ↕️     │
│ Sidecar │ ← TLS → │ Sidecar │ (Automatic encryption)
└──────────┘         └──────────┘

Features:

  • Automatic mTLS (mutual TLS encryption)
  • Traffic management (canary deployments, A/B testing)
  • Observability (distributed tracing with Jaeger)
  • Circuit breaking (prevent cascading failures)

Circuit Breaker Configuration:

# Istio DestinationRule
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: payment-service
spec:
  host: payment-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http1MaxPendingRequests: 50
        maxRequestsPerConnection: 2
    outlierDetection:
      consecutiveErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50

Data Consistency in Microservices

Problem: Distributed Transactions

User creates order:
1. Order Service: Create order (SUCCESS)
2. Payment Service: Charge card (FAIL)
3. Inventory Service: Decrease stock (PENDING)

Result: Inconsistent state

Solution: Saga Pattern

# Orchestration-based saga
class OrderSaga:
    def execute(self, order_data):
        try:
            # Step 1: Create order
            order = order_service.create_order(order_data)

            # Step 2: Reserve inventory
            inventory_service.reserve(order.items)

            # Step 3: Process payment
            payment = payment_service.charge(order.user_id, order.total)

            # Step 4: Confirm order
            order_service.confirm_order(order.id)

        except PaymentFailed:
            # Compensating transaction: unreserve inventory
            inventory_service.unreserve(order.items)
            order_service.cancel_order(order.id)
            raise

Microservices Tradeoffs

Benefits:

  • Independent deployability (deploy 10x per day)
  • Technology diversity (use best tool for job)
  • Team autonomy (reduce coordination)
  • Fault isolation (one service failure doesn’t crash entire system)

Challenges:

  • Operational complexity (10 services = 10 deployments, 10 databases)
  • Distributed debugging (trace request across services)
  • Data consistency (eventual consistency)
  • Network overhead (inter-service calls add 10-50ms latency)
  • Cost: 20-30% higher infrastructure costs vs monolith

Production Scale

  • Services: 20-100 microservices
  • Engineers: 100-500 (5-10 per service)
  • Deployment Frequency: 100-1000 deploys/day
  • Infrastructure Cost: $50K-200K/month

Stage 10: Monitoring, Logging, and Observability (Production-Critical)

Three Pillars of Observability

1. Metrics (What is happening?)

System Health:
- CPU: 45% average, 78% P95
- Memory: 62% used, 38% free
- Disk I/O: 1200 IOPS, 80MB/s
- Network: 500 Mbps in, 800 Mbps out

Application:
- Request rate: 5000 req/sec
- Error rate: 0.5% (25 errors/sec)
- Latency: P50=50ms, P95=200ms, P99=500ms
- Database connections: 150/200 in use

2. Logs (Why did it happen?)

{
  "timestamp": "2024-01-15T10:30:45.123Z",
  "level": "error",
  "service": "payment-service",
  "trace_id": "abc123",
  "user_id": 456789,
  "message": "Payment declined",
  "error": {
    "code": "insufficient_funds",
    "card_last4": "4242"
  }
}

3. Traces (How did request flow?)

Request ID: abc123
Total Duration: 450ms

1. API Gateway: 2ms
2. Auth Service: 15ms
3. User Service: 50ms
   └─ Database query: 45ms
4. Payment Service: 350ms ← SLOW
   ├─ Stripe API call: 320ms ← ROOT CAUSE
   └─ Database write: 25ms
5. Response: 8ms

Monitoring Stack

Prometheus + Grafana (Metrics)

# Prometheus scrape config
scrape_configs:
  - job_name: 'web-servers'
    static_configs:
      - targets: ['web1:9090', 'web2:9090', 'web3:9090']
    scrape_interval: 15s

# Application metrics (Node.js example)
const client = require('prom-client');

const httpRequestDuration = new client.Histogram({
    name: 'http_request_duration_ms',
    help: 'Duration of HTTP requests in ms',
    labelNames: ['method', 'route', 'status_code'],
    buckets: [10, 50, 100, 200, 500, 1000, 2000, 5000]
});

app.use((req, res, next) => {
    const start = Date.now();
    res.on('finish', () => {
        const duration = Date.now() - start;
        httpRequestDuration
            .labels(req.method, req.route?.path || 'unknown', res.statusCode)
            .observe(duration);
    });
    next();
});

ELK Stack (Elasticsearch, Logstash, Kibana) for Logs

// Structured logging (best practice)
logger.info('User login', {
    user_id: 123,
    email: 'user@example.com',
    ip: req.ip,
    user_agent: req.headers['user-agent'],
    duration_ms: 45
});

// Avoid unstructured logs
// logger.info(`User ${email} logged in from ${ip}`);  // Hard to search

Jaeger/Zipkin (Distributed Tracing)

# OpenTelemetry instrumentation
from opentelemetry import trace
from opentelemetry.instrumentation.requests import RequestsInstrumentor

RequestsInstrumentor().instrument()

tracer = trace.get_tracer(__name__)

def process_order(order_id):
    with tracer.start_as_current_span('process_order') as span:
        span.set_attribute('order.id', order_id)

        # This HTTP call is automatically traced
        user = requests.get(f'http://user-service/users/{user_id}')

        with tracer.start_as_current_span('database_query'):
            order = db.query('SELECT * FROM orders WHERE id = %s', [order_id])

        return order

Key Metrics to Monitor

System Metrics:

# Collect every 15 seconds
metrics = {
    'cpu_percent': psutil.cpu_percent(interval=1),
    'memory_percent': psutil.virtual_memory().percent,
    'disk_io_read_bytes': psutil.disk_io_counters().read_bytes,
    'disk_io_write_bytes': psutil.disk_io_counters().write_bytes,
    'network_sent_bytes': psutil.net_io_counters().bytes_sent,
    'network_recv_bytes': psutil.net_io_counters().bytes_recv
}

Application Metrics (RED Method):

  • Rate: Requests per second
  • Errors: Error rate (percentage or count)
  • Duration: Response time (P50, P95, P99)

Database Metrics:

-- PostgreSQL monitoring queries
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';  -- Active connections
SELECT pg_database_size('production') / 1024 / 1024 / 1024 AS size_gb;  -- DB size
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
FROM pg_tables ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 10;

Alerting Strategy

Alert Severity Levels:

P0 (Page immediately, 24/7):
- Service completely down (>50% error rate)
- Database unavailable
- Payment processing failures

P1 (Page during business hours):
- High error rate (>5%)
- High latency (P95 > 2 seconds)
- Disk space >90% full

P2 (Create ticket, no page):
- Elevated error rate (>1%)
- Slow queries (>500ms)
- Memory usage >80%

AlertManager Configuration:

groups:
  - name: api-alerts
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        annotations:
          summary: "High error rate on {{ $labels.instance }}"
          description: "Error rate is {{ $value }} (threshold: 5%)"

      - alert: HighLatency
        expr: histogram_quantile(0.95, http_request_duration_ms) > 2000
        for: 10m
        annotations:
          summary: "High latency on {{ $labels.route }}"
          description: "P95 latency is {{ $value }}ms"

Log Aggregation at Scale

Challenges:

  • 10K req/sec = 864M requests/day
  • Each request logs 3-5 lines = 2.5B log lines/day
  • Average log size: 500 bytes = 1.25TB/day
  • Retention: 30 days = 37.5TB storage

Cost Optimization:

# Sample logs (log 10% of requests)
if random.random() < 0.1 or response.status_code >= 400:
    logger.info('Request processed', {
        'path': request.path,
        'method': request.method,
        'status': response.status_code,
        'duration_ms': duration
    })

# Result: 10x cost reduction, still capture all errors

Tiered Storage:

Hot Storage (1-7 days):    Elasticsearch - Fast search, $50/TB/month
Warm Storage (8-30 days):  S3 + Athena - Queryable, $23/TB/month
Cold Storage (>30 days):   S3 Glacier - Archive, $4/TB/month

Production Monitoring Costs

  • Metrics: $500-2K/month (Datadog/New Relic)
  • Logs: $1K-5K/month (10TB/month ingestion)
  • Traces: $500-1K/month (sample 1-10% of requests)
  • Total: $2K-8K/month for comprehensive observability

Stage 11: Auto-Scaling Strategies

Horizontal Auto-Scaling

Metric-Based Scaling (AWS Auto Scaling Group):

AutoScalingGroup:
  MinSize: 3
  MaxSize: 50
  DesiredCapacity: 5

ScalingPolicies:
  - PolicyName: ScaleUpOnCPU
    MetricName: CPUUtilization
    TargetValue: 70
    ScaleUpCooldown: 300s    # Wait 5 min before scaling up again

  - PolicyName: ScaleUpOnRequests
    MetricName: RequestCountPerTarget
    TargetValue: 1000        # 1000 requests per instance
    ScaleDownCooldown: 600s  # Wait 10 min before scaling down

Predictive Scaling:

# AWS Predictive Scaling analyzes historical data
# Scales BEFORE traffic spikes

Typical pattern:
08:00 - 10:00: Morning spike (+200%)
12:00 - 13:00: Lunch spike (+150%)
18:00 - 20:00: Evening spike (+250%)

Predictive scaling at 07:45: Scale from 515 instances
(vs reactive scaling at 08:10: Slow, users experience lag)

Kubernetes Horizontal Pod Autoscaler (HPA):

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 3
  maxReplicas: 100
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

Database Auto-Scaling

Aurora Serverless (AWS):

Engine: aurora-postgresql
MinCapacity: 2 ACU  # 2 Aurora Capacity Units = 4GB RAM
MaxCapacity: 64 ACU # 64 ACU = 128GB RAM
AutoPause: true
AutoPauseDelay: 300 # Pause after 5 min of inactivity

Cost:
- Active: $0.06/ACU/hour
- Storage: $0.10/GB/month
- Paused: Only pay for storage

Use case: Dev/test environments, sporadic traffic

Connection Pooling (PgBouncer):

# PgBouncer config
[databases]
production = host=db.internal port=5432

[pgbouncer]
pool_mode = transaction
max_client_conn = 1000    # Accept 1000 app connections
default_pool_size = 25    # But only 25 DB connections
reserve_pool_size = 5     # Emergency pool

# Result: 1000 app servers can share 25 DB connections
# DB connection overhead: 10MB/conn → 250MB instead of 10GB

Auto-Scaling Best Practices

1. Scale on Multiple Metrics:

Scale UP if:
- CPU > 70% for 5 minutes, OR
- Request rate > 1000/instance, OR
- Error rate > 1%

Scale DOWN if:
- CPU < 30% for 15 minutes, AND
- Request rate < 200/instance, AND
- Error rate < 0.1%

2. Graceful Shutdown:

import signal
import time

def graceful_shutdown(signum, frame):
    logger.info('Received SIGTERM, starting graceful shutdown')

    # Stop accepting new requests
    server.stop_accepting()

    # Wait for in-flight requests to complete (max 30 sec)
    timeout = 30
    while server.active_requests() > 0 and timeout > 0:
        time.sleep(1)
        timeout -= 1

    # Close database connections
    db.close_all_connections()

    logger.info('Graceful shutdown complete')
    sys.exit(0)

signal.signal(signal.SIGTERM, graceful_shutdown)

3. Pre-Warming:

# Warm up instance before adding to load balancer
def health_check():
    if not is_warmed_up:
        # Pre-load caches, establish connections
        redis.ping()
        db.execute('SELECT 1')
        load_config_cache()
        is_warmed_up = True

    return {'status': 'healthy' if is_warmed_up else 'warming'}

Stage 12: Multi-Region Deployment (Global Scale)

Global Architecture

            ┌──────────────────┐
            │  Route 53 (DNS)  │
            │  Geo-routing     │
            └────────┬─────────┘

      ┌──────────────┼──────────────┐
      │              │              │
      ↓              ↓              ↓
┌─────────────┐ ┌──────────┐ ┌──────────┐
│  US-EAST-1  │ │ EU-WEST-1│ │AP-SOUTH-1│
│ (Primary)   │ │ (Backup) │ │ (Active) │
│             │ │          │ │          │
│ LB + 20 Web │ │LB + 15 W │ │LB + 10 W │
│ Master DB   │ │Replica DB│ │Replica DB│
│ Redis       │ │Redis     │ │Redis     │
└─────────────┘ └──────────┘ └──────────┘

Multi-Region Database Strategy

1. Active-Passive (Disaster Recovery):

US-EAST-1 (Primary):
- Handles 100% writes
- Handles 100% reads

EU-WEST-1 (Passive):
- Receives async replication (lag: 1-5 seconds)
- Handles 0% traffic (warm standby)

Failover:
- Manual failover: 5-10 minutes
- Automatic failover: 1-2 minutes (AWS RDS)
- Data loss: Up to 5 seconds (RPO=5s)

2. Active-Active (Read Replicas):

US-EAST-1 (Primary):
- Handles 100% writes
- Handles 60% reads (US users)

EU-WEST-1 (Read Replica):
- Handles 40% reads (EU users)
- Promotes to master if US fails

Latency improvement:
- US user: 150ms → 20ms
- EU user: 200ms → 15ms

3. Multi-Master (CockroachDB, Cassandra):

US-EAST-1:     EU-WEST-1:     AP-SOUTH-1:
Write + Read   Write + Read   Write + Read
     ↓              ↓              ↓
   ┌───────────────┴────────────────┐
   │   Distributed Consensus (Raft) │
   │   Eventual Consistency          │
   └─────────────────────────────────┘

Benefits:
- No single point of failure
- Write to nearest region (low latency)

Tradeoffs:
- Eventual consistency (not ACID)
- Conflict resolution needed

CDN + Multi-Region

CloudFront with Regional Failover:

Origins:
  - Id: us-origin
    DomainName: us-app.example.com
    Priority: 1
  - Id: eu-origin
    DomainName: eu-app.example.com
    Priority: 2  # Failover

OriginGroups:
  - FailoverCriteria:
      StatusCodes: [500, 502, 503, 504]
    Members:
      - OriginId: us-origin
      - OriginId: eu-origin

# If US origin fails, automatically route to EU

Data Sovereignty and Compliance

GDPR Requirements:

EU user data MUST stay in EU:

User from Germany:
├─ Route53 → EU-WEST-1 region
├─ Data stored in Frankfurt data center
├─ Backups stored in EU
└─ Analytics processed in EU

US user data:
├─ Route53 → US-EAST-1 region
├─ Data stored in Virginia data center
└─ Not subject to GDPR

Implementation:

def get_user_region(ip_address):
    # GeoIP lookup
    country = geoip.country(ip_address)

    if country in ['DE', 'FR', 'GB', 'IT', 'ES']:
        return 'eu-west-1'
    elif country in ['IN', 'SG', 'JP', 'AU']:
        return 'ap-south-1'
    else:
        return 'us-east-1'

def create_user(email, password, ip_address):
    region = get_user_region(ip_address)
    db = get_database(region)

    user = db.create_user(email=email, password=password, region=region)

    # User data never leaves designated region
    return user

Disaster Recovery (DR)

RTO and RPO:

  • RTO (Recovery Time Objective): How long until service restored?
  • RPO (Recovery Point Objective): How much data loss acceptable?

DR Tiers:

Tier 1 - Cold Standby:
- RTO: 24-72 hours
- RPO: 24 hours
- Cost: 10% of primary
- Method: Backup tapes, offsite storage

Tier 2 - Warm Standby:
- RTO: 1-4 hours
- RPO: 1 hour
- Cost: 30% of primary
- Method: Replicated DB, no active servers

Tier 3 - Hot Standby:
- RTO: 5-15 minutes
- RPO: 1 minute
- Cost: 50% of primary
- Method: Active-passive with auto-failover

Tier 4 - Active-Active:
- RTO: 0 seconds (no downtime)
- RPO: 0 seconds (no data loss)
- Cost: 100% of primary (2x infrastructure)
- Method: Multi-region active-active

AWS Multi-Region DR:

# Automated failover with Route53 health checks
aws route53 change-resource-record-sets --hosted-zone-id Z123 --change-batch '{
  "Changes": [{
    "Action": "UPSERT",
    "ResourceRecordSet": {
      "Name": "api.example.com",
      "Type": "A",
      "SetIdentifier": "us-east-1",
      "Failover": "PRIMARY",
      "AliasTarget": {
        "HostedZoneId": "Z123",
        "DNSName": "us-load-balancer.aws.com",
        "EvaluateTargetHealth": true
      }
    }
  }, {
    "Action": "UPSERT",
    "ResourceRecordSet": {
      "Name": "api.example.com",
      "Type": "A",
      "SetIdentifier": "eu-west-1",
      "Failover": "SECONDARY",
      "AliasTarget": {
        "HostedZoneId": "Z456",
        "DNSName": "eu-load-balancer.aws.com",
        "EvaluateTargetHealth": true
      }
    }
  }]
}'

# If US health check fails → DNS automatically points to EU

Multi-Region Cost Analysis

Infrastructure Costs:

Single Region (US-EAST-1):
- Compute: 50 instances × $200/month = $10,000
- Database: $5,000/month
- CDN: $2,000/month
- Total: $17,000/month

Multi-Region (US + EU + ASIA):
- Compute: 100 instances × $200/month = $20,000 (2x)
- Database: $12,000/month (replication + storage)
- CDN: $3,000/month (multi-region)
- Cross-region data transfer: $2,000/month
- Total: $37,000/month (2.2x cost)

Benefits:
- 50-70% latency reduction for international users
- 99.99% availability (vs 99.9% single region)
- Regulatory compliance (GDPR, data residency)

Conclusion: Scaling Journey Summary

Capacity Roadmap

Stage 1-2: Single Server → Separate DB (0-10K Users)

  • Cost: $100-600/month
  • Complexity: Low
  • Team: 1-3 engineers

Stage 3-4: Load Balancer → DB Replication (10K-100K Users)

  • Cost: $1K-4K/month
  • Complexity: Medium
  • Team: 3-10 engineers

Stage 5-6: Caching → CDN (100K-2M Users)

  • Cost: $3K-10K/month
  • Complexity: Medium-High
  • Team: 10-30 engineers

Stage 7-8: Sharding → Message Queues (2M-20M Users)

  • Cost: $10K-50K/month
  • Complexity: High
  • Team: 30-100 engineers

Stage 9-12: Microservices → Multi-Region (20M+ Users)

  • Cost: $50K-500K/month
  • Complexity: Very High
  • Team: 100-1000+ engineers

Key Takeaways

  1. Scale gradually - Don’t over-engineer early. Instagram scaled to 30M users with 3 engineers and PostgreSQL.

  2. Measure everything - You can’t optimize what you don’t measure. Start with basic monitoring from day one.

  3. Eliminate single points of failure - Load balancers, multiple databases, multi-AZ deployments.

  4. Cache aggressively - 80%+ of requests can be served from cache. Redis is your best friend.

  5. Design for failure - Services will fail. Plan for graceful degradation and quick recovery.

  6. Automate operations - Auto-scaling, automated backups, infrastructure as code. Manual processes don’t scale.

  7. Cost optimization - Scaling isn’t just about performance. A 10% efficiency gain at scale saves $50K/year.

  8. Team structure mirrors architecture - Monolith works for small teams. Microservices need organizational boundaries.

Remember: Every large-scale system started as a simple application. Focus on solving user problems first, then scale when necessary. Premature optimization is the root of all evil.