Design Twitch

Twitch is a live streaming platform that serves millions of concurrent viewers watching streamers broadcast gaming, creative content, and entertainment. The system must handle massive-scale live video ingestion, real-time transcoding, global content delivery, sub-second latency chat, and millions of concurrent viewers with diverse network conditions.

Designing Twitch presents unique challenges including ultra-low latency video delivery, real-time transcoding at massive scale, synchronized chat with video streams, and handling viral events that cause sudden traffic spikes.

Step 1: Understand the Problem and Establish Design Scope

Before diving into the design, it’s crucial to define the functional and non-functional requirements. For a live streaming platform, functional requirements describe what users and streamers can do, while non-functional requirements define system qualities like latency, scalability, and reliability.

Functional Requirements

Core Requirements (Priority 1-3):

  1. Streamers should be able to broadcast live video via RTMP or WebRTC.
  2. Viewers should be able to watch live streams with adaptive bitrate streaming based on their network conditions.
  3. Viewers should be able to participate in real-time chat synchronized with the video stream.
  4. The system should automatically record broadcasts as VODs (Video on Demand) for later viewing.

Below the Line (Out of Scope):

  • Users should be able to create short clips from live streams or VODs.
  • Streamers should be able to monetize through subscriptions, donations, and bits.
  • Users should be able to follow streamers and receive notifications when they go live.
  • The system should provide stream discovery through categories, games, and personalized recommendations.
  • Users should be able to rate streams and provide feedback.

Non-Functional Requirements

Core Requirements:

  • The system should prioritize low latency (2-5 seconds glass-to-glass latency for live streams).
  • The system should ensure chat messages are delivered in under 1 second.
  • The system should handle massive scale (10M+ concurrent viewers, 100K+ concurrent streams).
  • The system should provide adaptive bitrate streaming to handle varying network conditions.

Below the Line (Out of Scope):

  • The system should maintain 99.95% availability for streaming services.
  • The system should ensure secure content delivery and user authentication.
  • The system should have comprehensive monitoring and alerting for stream health.
  • The system should optimize costs through efficient CDN usage and storage tiering.

Clarification Questions & Assumptions:

  • Platform: Web, mobile apps (iOS/Android), and smart TV apps.
  • Scale: 10 million concurrent viewers, 100,000 concurrent streams at peak times.
  • Video Quality: Support up to 1080p60 (6000 kbps) for partner streamers.
  • Location Update Frequency: Transcoding segments every 2-6 seconds for HLS/DASH delivery.
  • Geographic Coverage: Global, with CDN edge locations in 200+ points of presence.
  • Chat Volume: Handle 1M+ messages per second during major events.

Step 2: Propose High-Level Design and Get Buy-in

Planning the Approach

For a live streaming platform, we’ll build the design sequentially through the functional requirements: first video ingestion, then transcoding, then delivery to viewers, and finally real-time chat. This logical flow mirrors the actual data path from streamer to viewer.

Defining the Core Entities

To satisfy our key functional requirements, we’ll need the following entities:

Streamer: Any user who broadcasts content on the platform. Includes personal information, streaming preferences, partner status, stream key for authentication, and monetization settings.

Viewer: Any user who watches streams. Contains viewing preferences, followed channels, subscription status, chat permissions, and viewing history.

Stream: An active live broadcast session. Records the streamer identity, current viewer count, stream metadata (title, game category, tags), bitrate information, health metrics, and start time.

Chat Message: Individual messages sent in stream chat rooms. Includes the sender identity, message content, timestamp, room/channel identifier, emotes used, and sender badges (subscriber, moderator, etc.).

VOD (Video on Demand): A recorded version of a past broadcast. Contains the original stream reference, video file location, duration, view count, thumbnail URL, and storage tier information.

Segment: Individual video chunks for HLS/DASH streaming. Includes quality level (1080p, 720p, etc.), sequence number, duration (typically 2-6 seconds), and file path.

API Design

Start Stream Endpoint: Used by streamers to initiate a broadcast session. The stream key in the request authenticates the streamer.

POST /streams/start -> Stream
Body: {
  streamKey: string,
  title: string,
  category: string,
  tags: string[]
}

Get Stream Manifest Endpoint: Used by viewers to retrieve the HLS master playlist containing all available quality variants.

GET /streams/:streamId/manifest.m3u8 -> Master Playlist

Send Chat Message Endpoint: Used by viewers to send messages to a stream’s chat room.

POST /chat/messages -> Success/Error
Body: {
  channelId: string,
  message: string
}

Note: The userId is present in the session cookie or JWT and not in the body. The system validates the user’s permissions to send messages in this channel.

Get VOD Endpoint: Allows viewers to access recorded broadcasts with seek capability.

GET /vods/:vodId -> VOD metadata and streaming URL

High-Level Architecture

Let’s build up the system sequentially, addressing each functional requirement:

1. Streamers should be able to broadcast live video via RTMP or WebRTC

The core components necessary to fulfill live video ingestion are:

  • Streamer Client: Broadcasting software like OBS (Open Broadcaster Software) or SLOBS. Encodes video using H.264 or HEVC codecs and sends to the platform via RTMP protocol.
  • Ingest Service: Regional edge servers that accept RTMP streams from broadcasters. Deployed in 20+ regions globally for low latency ingestion. Validates stream keys, monitors connection health, and forwards raw streams to the transcoding service.
  • Stream Database: Stores active stream metadata including streamer information, current status, viewer count, and health metrics. Uses PostgreSQL for ACID properties with read replicas for scaling.

Stream Ingestion Flow:

  1. The streamer configures OBS with the RTMP URL and their unique stream key, then starts broadcasting.
  2. DNS-based geographic routing directs the RTMP connection to the nearest regional ingest server.
  3. The Ingest Service validates the stream key against the Stream Database, checking if the user is authorized and not currently streaming from another location.
  4. Upon successful validation, the Ingest Service accepts the stream and begins forwarding the raw video data to the Transcoding Service.
  5. The Stream Database is updated with the new active stream, including start time, current status, and connection metadata.
2. Viewers should be able to watch live streams with adaptive bitrate streaming

We extend our design to support video transcoding and delivery:

  • Transcoding Service: GPU-accelerated clusters that convert the single source stream into multiple quality variants (1080p60, 720p60, 480p, 360p, 160p, audio-only). Uses FFmpeg with NVENC hardware acceleration on NVIDIA T4, A10, or A100 GPUs. Outputs HLS segments (2-6 second chunks) and generates manifest playlists.
  • Origin Servers: Store the transcoded HLS segments and serve manifest files. Maintain a rolling 30-60 second window of live segments. Uses Nginx or S3 for segment storage.
  • Multi-CDN Layer: Global content delivery network with 200+ edge locations. Uses multiple providers (Cloudflare, Akamai, AWS CloudFront, Fastly) for redundancy and cost optimization.
  • Video Player Client: HTML5 video player with HLS.js or dash.js for adaptive bitrate streaming. Monitors bandwidth and automatically switches between quality levels.

Video Delivery Flow:

  1. The Transcoding Service receives the raw stream and encodes it into multiple bitrate variants simultaneously using GPU acceleration.
  2. Each variant is segmented into 2-6 second chunks and written to Origin Servers along with updated HLS playlists.
  3. When a viewer requests to watch a stream, their client fetches the master playlist from the Origin Server (via CDN).
  4. The master playlist lists all available quality variants with their bandwidth requirements and resolutions.
  5. The client selects an appropriate quality based on measured bandwidth and begins fetching video segments from the CDN.
  6. CDN edge servers cache popular segments, serving them directly to nearby viewers without hitting the origin.
  7. The player continuously monitors playback quality and network conditions, switching quality variants as needed for smooth playback.
3. Viewers should be able to participate in real-time chat synchronized with the video stream

We introduce new components to facilitate real-time chat:

  • Chat Service: WebSocket servers that maintain persistent connections with viewers. Each server handles approximately 100,000 concurrent WebSocket connections. Uses sticky sessions to ensure viewers stay connected to the same server.
  • Message Queue (Kafka): Distributed message broker that handles pub/sub for chat rooms. Each channel has a chat room, and messages are published to Kafka topics partitioned by channel ID.
  • Chat Database (Redis): In-memory data store for chat state including recent message history (last 24 hours), user presence, rate limiting counters, and moderation settings (slow mode, emote-only, sub-only).

Real-Time Chat Flow:

  1. When a viewer opens a stream, the client establishes a WebSocket connection to the Chat Service.
  2. The viewer authenticates with their session token and joins the channel’s chat room.
  3. The Chat Service subscribes to the relevant Kafka topic for this channel and maintains a set of WebSocket connections for this room.
  4. When a viewer sends a message, it’s sent via WebSocket to the Chat Service.
  5. The Chat Service validates the message (rate limiting, banned words, user permissions) using Redis.
  6. Valid messages are published to the Kafka topic for this channel, timestamped for synchronization.
  7. All Chat Service instances subscribed to this topic receive the message and fan it out to their local WebSocket connections.
  8. Chat messages are also stored in Redis with a 24-hour TTL for chat replay on VODs.
  9. The system maintains sub-second message delivery even for channels with 100,000+ concurrent viewers.
4. The system should automatically record broadcasts as VODs for later viewing

We add components for VOD generation and storage:

  • VOD Processing Service: Triggered when a stream ends. Concatenates the live HLS segments into complete MP4 files, generates thumbnails, extracts audio tracks, and creates preview clips. Runs as background jobs to avoid impacting live streaming performance.
  • VOD Storage (S3): Tiered object storage with lifecycle policies. Hot tier (first 30 days) uses S3 Standard for fast access. Warm tier (31-90 days) uses S3 Infrequent Access. Cold tier (91-365 days) uses S3 Glacier. Archives or deletes content based on channel settings after 365 days.
  • VOD Database: Stores VOD metadata including title, description, duration, view count, thumbnail URL, storage location, and storage tier. PostgreSQL with indexing on channel_id and recorded_at for efficient queries.

VOD Generation Flow:

  1. When a stream ends, the Ingest Service publishes a stream-ended event to Kafka.
  2. The VOD Processing Service consumes this event and begins the VOD generation workflow.
  3. It collects all HLS segments from the Origin Server for the completed stream.
  4. Using FFmpeg, it concatenates segments into a complete MP4 file with proper faststart encoding for web playback.
  5. Thumbnail images are generated by extracting frames every 10 seconds throughout the video.
  6. The complete VOD and thumbnails are uploaded to S3 in the hot tier.
  7. VOD metadata is inserted into the VOD Database, making it available for viewer discovery.
  8. The VOD can be streamed using the same HLS/DASH delivery infrastructure as live streams.
  9. Chat messages are linked to video timestamps, allowing synchronized chat replay when viewers watch VODs.
  10. After 30 days, S3 lifecycle policies automatically move VODs to cheaper storage tiers unless they’re frequently accessed.

Step 3: Design Deep Dive

With the core functional requirements met, it’s time to dig into the non-functional requirements and critical technical challenges. These deep dives separate production-ready systems from basic designs.

Deep Dive 1: How do we achieve low latency (2-5 seconds) for live streaming?

Live streaming latency is measured as “glass-to-glass” time - from when the streamer’s camera captures a frame to when a viewer sees it on their screen. Traditional HLS streaming has 10-30 second latency, which is unacceptable for interactive content. We need to optimize every step of the pipeline.

Latency Breakdown:

  • Stream encoding and upload to ingest: 500ms - 1s
  • Transcoding processing: 1-2s
  • Segment creation and origin storage: 500ms
  • CDN propagation and caching: 500ms - 1s
  • Client buffering and playback: 1-2s
  • Total: 4-6.5 seconds (target: 2-5 seconds)

Solution: Optimize Segment Duration and Pipeline

Segment Duration Trade-offs: The fundamental tension is between latency and quality. Shorter segments mean lower latency but more overhead and potential quality issues. We use 2-second segments for live streams as a compromise.

  • 6-second segments: Lower latency but excellent compression efficiency and reduced CDN requests
  • 2-second segments: Lower latency but acceptable compression with more frequent manifest updates
  • Sub-1-second segments: Lowest latency but poor compression efficiency and high CDN costs

Pipeline Optimizations:

Ingest Layer:

  • Use regional edge locations for ingestion to minimize upload latency
  • Accept RTMP connections with minimal buffering at the server side
  • Implement connection health monitoring to detect and alert on encoding issues quickly
  • Use UDP-based protocols for ultra-low latency scenarios where some packet loss is acceptable

Transcoding Layer:

  • Use GPU acceleration (NVENC on NVIDIA cards) for 10x faster encoding than CPU
  • Implement parallel encoding of multiple quality variants to avoid sequential processing delays
  • Use low-latency encoding presets that prioritize speed over compression efficiency
  • Configure keyframe intervals to match segment duration (2 seconds = keyframe every 120 frames at 60fps) for instant quality switching
  • Use constant bitrate (CBR) mode for predictable streaming performance

Origin and CDN Layer:

  • Write segments to origin storage immediately upon completion rather than batching
  • Use object storage with multi-region replication for origin redundancy
  • Configure CDN edge servers to cache segments with very short TTL (2-4 seconds) to balance freshness and caching efficiency
  • Implement manifest stitching at the edge to reduce round trips
  • Use HTTP/2 for multiplexed segment requests to reduce connection overhead

Client-Side Optimizations:

  • Start playback with minimal buffering (0.5-1 second buffer instead of 3-5 seconds)
  • Implement aggressive manifest refresh (every 1-2 seconds) to discover new segments quickly
  • Use chunked transfer encoding to begin decoding segments before they’re fully downloaded
  • Prefetch upcoming segments based on current playback position
  • Monitor buffer health and dynamically adjust between latency and stability

WebRTC for Ultra-Low Latency: For scenarios requiring sub-second latency (e.g., game streaming with viewer interaction), RTMP/HLS can’t achieve the target. WebRTC provides 300-500ms latency through:

  • Peer-to-peer connections when possible
  • UDP-based transport with selective retransmission
  • No segment-based delivery - continuous stream
  • Trade-off: More complex infrastructure, limited scalability, requires browser support

We keep RTMP/HLS as the default for scalability and compatibility, offering WebRTC as a premium “low latency mode” for streamers who need it.

Deep Dive 2: How do we scale transcoding to handle 100,000+ concurrent streams?

Transcoding is the most computationally expensive operation in the system. Each stream must be converted into 6-8 quality variants in real-time, which means processing hundreds of thousands of video frames per second across all streams.

Problem: Transcoding Resource Requirements

For a single 1080p60 stream transcoded to 6 variants:

  • Processing: Approximately 0.1-0.2 GPU (T4 equivalent)
  • Each T4 GPU can handle 10-20 concurrent 1080p60 transcodes
  • For 100,000 streams: 5,000-10,000 GPUs required
  • Cost: T4 instances cost roughly $0.35/hour, so 10,000 GPUs = $3,500/hour = $2.5M/month just for compute

Solution: Dynamic Transcoding and Resource Optimization

1. Quality Variant Prioritization: Not all streams need all quality variants immediately. We implement dynamic transcoding based on viewer demand:

  • Source Pass-Through: The original quality is always available immediately without transcoding
  • On-Demand Variants: For new or small streams (< 50 viewers), only generate 720p and 480p variants
  • Full Variants: Once a stream reaches 100+ viewers, scale up to all 6-8 quality variants
  • Viewer-Based Scaling: Monitor which quality variants are actually being watched and only maintain those with active viewers

This reduces transcoding load by 40-60% during normal operations while ensuring large streams have full quality options.

2. Transcoding Cluster Architecture:

Job Scheduling:

  • Use Kubernetes for orchestration with GPU support and node pools
  • Implement a bin-packing algorithm to maximize GPU utilization per node
  • Each transcoding pod is assigned a single stream and requests GPU resources
  • Use pod priority classes to ensure large/partner streams get resources first during capacity constraints
  • Implement graceful shutdown - when a pod is terminated, it continues encoding for 30 seconds while a replacement is started elsewhere

Auto-Scaling:

  • Monitor pending transcoding queue depth in Kafka
  • Scale GPU node pools up when queue depth exceeds threshold (e.g., 100 pending jobs)
  • Scale down during off-peak hours (typically 4am-8am in each region)
  • Use spot instances for 60-70% of capacity to reduce costs, with on-demand instances for baseline
  • Implement multi-region fallback - if one region is at capacity, route new streams to another region

Hardware Optimization:

  • Use NVIDIA T4 GPUs for most streams (good price/performance)
  • Use A10 or A100 GPUs for 4K or high-bitrate streams that need more processing power
  • Separate H.264 and HEVC encoding workloads as they have different resource profiles
  • Use CPU-based encoding as fallback when GPU capacity is exhausted (accepts higher latency)

3. Encoding Settings Optimization:

Preset Selection:

  • Use NVENC P4 preset for balance between quality and speed (P1 = fastest but lower quality, P7 = slowest but best quality)
  • Configure rate control for CBR (constant bitrate) to ensure predictable bandwidth usage
  • Set keyframe interval to 2 seconds to enable seamless quality switching
  • Disable B-frames for lower latency at the cost of slightly worse compression

Bitrate Ladder Optimization: Different content types need different bitrate ladders. A static “talking head” stream can use lower bitrates than a fast-motion gaming stream while maintaining the same perceived quality:

  • Standard ladder: 1080p60@6000kbps, 720p60@4500kbps, 720p30@3000kbps, 480p@1500kbps, 360p@900kbps
  • Low-motion optimized: Reduce bitrates by 30% for minimal visual degradation
  • High-motion optimized: Increase bitrates by 20% to maintain quality during action scenes

We use machine learning to analyze stream content complexity and dynamically adjust the bitrate ladder.

4. Segment Storage and Cleanup:

Origin Storage:

  • Write segments to origin storage with 60-second expiration for live streams
  • Use in-memory storage (Redis or Memcached) for active segments to reduce I/O latency
  • Implement background cleanup jobs to delete expired segments
  • Separately store segments for VOD generation with longer retention

Cost Calculation: With these optimizations, transcoding costs can be reduced from $2.5M/month to approximately $1M-1.5M/month while maintaining quality and handling peak loads.

Deep Dive 3: How do we handle chat at scale with 1M+ messages per second?

Real-time chat is a critical feature that needs to scale to support channels with 100,000+ concurrent viewers while maintaining sub-second message delivery.

Problem: Fan-Out at Scale

When a popular streamer with 100,000 concurrent viewers receives a chat message, the system must:

  • Receive and validate the message from the sender
  • Fan out to 100,000 WebSocket connections
  • Deliver within 1 second for a responsive experience

If done naively (each message triggers 100,000 individual sends), the system would quickly become overloaded.

Solution: Distributed Chat Architecture with Message Queuing

1. WebSocket Connection Management:

Connection Pooling:

  • Deploy multiple Chat Service instances, each handling approximately 100,000 WebSocket connections
  • Use Layer 4 load balancing with consistent hashing on channel_id to ensure viewers for the same channel connect to the same server (or subset of servers)
  • Each Chat Service instance maintains an in-memory map of room_id to set of WebSocket connections
  • Memory usage: approximately 10KB per connection, so 100,000 connections = 1GB RAM

Connection Lifecycle:

  • When a viewer opens a stream, the client establishes a WebSocket connection to the Chat Service
  • The connection authenticates using a JWT token and requests to join a specific channel’s room
  • The Chat Service adds this WebSocket to its in-memory room participant set
  • Heartbeat messages every 30 seconds maintain the connection and detect disconnects
  • On disconnect, the WebSocket is removed from the room participant set

2. Kafka-Based Message Distribution:

Why Kafka: Traditional pub/sub systems struggle with the fan-out requirements. Kafka provides:

  • Durable message storage (24-hour retention for chat replay)
  • High throughput (millions of messages per second)
  • Horizontal scalability through partitioning
  • Exactly-once semantics to prevent duplicate messages

Topic Structure:

  • Topic: chat.messages
  • Partitions: 100 (partitioned by channel_id hash for even distribution)
  • Replication factor: 3 (for durability and availability)
  • Retention: 24 hours (for VOD chat replay)

Message Flow:

  • When a viewer sends a chat message, it arrives via WebSocket at the Chat Service
  • The Chat Service validates the message (rate limiting, profanity filter, user permissions)
  • The validated message is published to the Kafka topic chat.messages with the channel_id as the key
  • All Chat Service instances are consumers in a consumer group, each consuming from a subset of partitions
  • When a Chat Service receives a message from Kafka, it looks up the channel_id in its local room map
  • If it has WebSocket connections for this channel, it fans out the message to all of them
  • If not, it discards the message (other Chat Service instances will handle it)

This architecture ensures each message is processed by exactly one Chat Service instance per room, avoiding duplicate sends.

3. Redis for Chat State:

Rate Limiting: To prevent spam, each user is limited to 20 messages per 30 seconds (configurable by channel). We use Redis INCR with TTL:

  • Key: ratelimit:user:{userId}:{channelId}
  • On message: INCR the key and set TTL to 30 seconds if it’s the first message in the window
  • If count exceeds 20, reject the message
  • This provides atomic rate limiting across all Chat Service instances

Chat History: Store the last 100 messages per channel for late joiners:

  • Key: chat:history:{channelId} (Redis sorted set)
  • Score: message timestamp
  • Value: serialized message JSON
  • TTL: 24 hours
  • When a viewer joins, they receive the last 100 messages from Redis before receiving real-time messages

Moderation State:

  • Key: moderation:{channelId} (Redis hash)
  • Fields: slow_mode_seconds, emote_only, sub_only, banned_users, timed_out_users
  • Checked before accepting messages to enforce moderation rules
  • Updated via moderator commands

User Presence: Track which users are active in each channel:

  • Key: presence:{channelId} (Redis set)
  • Members: user IDs
  • Used for displaying viewer count and online status
  • Cleaned up periodically to remove stale entries

4. Handling Viral Moments:

When a major event happens (e.g., a game tournament final), message rates can spike 10-100x:

Scaling Strategies:

  • Kafka partitions distribute load across consumers
  • Add more Chat Service instances to handle increased WebSocket connections
  • Implement message sampling for extremely large channels (e.g., only show 100 messages/second even if 1000 are sent) with a notice to viewers
  • Use Redis Cluster for distributed rate limiting and state management
  • Implement priority tiers - subscribers and moderators always get through, while free users may be rate limited more aggressively

Circuit Breaker: If Kafka or Redis becomes unavailable, implement graceful degradation:

  • Continue accepting messages but don’t persist them
  • Fan out to local WebSocket connections only
  • Display a warning to viewers that chat may be unreliable
  • Automatically recover when dependencies become healthy again

Deep Dive 4: How do we optimize CDN costs while maintaining performance?

CDN bandwidth is one of the largest operational costs for a video streaming platform. With 10 million concurrent viewers watching at an average bitrate of 3.5 Mbps, the egress bandwidth is 35 Tbps (terabits per second), which can cost millions of dollars per month.

Problem: CDN Egress Costs

Typical CDN pricing:

  • $0.02 - $0.08 per GB depending on volume and provider
  • 35 Tbps = 4.375 TB/second = 15,750 TB/hour = 378,000 TB/day
  • At $0.05/GB: 378,000 TB/day × 30 days × 1000 GB/TB × $0.05 = $567M/month

Obviously, this is unsustainable. Real-world costs are lower due to volume discounts, but CDN optimization is still critical.

Solution: Multi-CDN Strategy with Intelligent Caching

1. Multi-CDN Provider Strategy:

Why Multiple CDNs:

  • Cost optimization: Use the cheapest CDN for each region
  • Performance: Different CDNs perform better in different geographies
  • Redundancy: If one CDN has issues, automatically failover to another
  • Negotiation leverage: Play CDN providers against each other for better pricing

CDN Selection:

  • Primary CDNs: Cloudflare (good global coverage), Akamai (premium quality), AWS CloudFront (integrated with AWS services), Fastly (real-time purging)
  • Each viewer’s request goes through a CDN selector service
  • Selection based on: viewer location, CDN health scores, current CDN load, cost per region
  • Example: Use Cloudflare in North America, Akamai in Europe, CloudFront in Asia-Pacific

2. Cache Hit Ratio Optimization:

The cache hit ratio is the percentage of requests served from edge cache vs. origin. Our target is 95%+.

Segment Caching:

  • HLS segments are cached at CDN edge with a TTL equal to the segment duration (2-6 seconds)
  • For popular streams, the first viewer in a region pulls from origin, subsequent viewers hit the edge cache
  • Cache-Control header: public, max-age=2, s-maxage=2
  • Example: A stream with 10,000 viewers in a single CDN edge location results in 1 origin request per segment, 9,999 cache hits (99.99% hit ratio)

Manifest Caching:

  • Master playlists and variant playlists update frequently (every 2 seconds with new segments)
  • Short TTL (2-4 seconds) balances freshness with caching benefits
  • Manifest requests are smaller (1-10KB) compared to segments (1-10MB), so lower hit ratio is acceptable

VOD Caching:

  • VOD content is static, so we can use much longer cache TTLs (1 hour to 1 day)
  • Popular VODs achieve 99%+ cache hit ratios
  • Use immutable cache headers for segments that won’t change

3. Viewer Aggregation Effect:

The economics improve dramatically with scale:

  • 1 viewer on a stream: 100% of traffic hits origin (0% cache hit)
  • 10 viewers in same region: 10% origin, 90% edge (90% cache hit)
  • 100 viewers in same region: 1% origin, 99% edge (99% cache hit)
  • 1,000 viewers in same region: 0.1% origin, 99.9% edge (99.9% cache hit)

This is why large streams are actually more cost-efficient per viewer than small streams.

4. Smart Routing and Edge Computing:

Geographic Load Balancing:

  • Route viewers to the nearest CDN edge location using DNS-based geolocation
  • Reduces latency and improves user experience
  • Concentrates viewers by region to maximize cache hit ratios

Edge Computing:

  • Use CDN edge workers (Cloudflare Workers, Lambda@Edge) for request processing
  • Handle manifest manipulation at the edge without hitting origin
  • Implement A/B testing, personalization, and ad insertion at the edge
  • Reduce origin load and improve response times

5. Bandwidth Commitment and Negotiation:

Volume Discounts:

  • Commit to minimum monthly bandwidth (e.g., 1 PB/month) for significant discounts (50-80% off list price)
  • Sign multi-year contracts for better rates
  • Use egress analytics to forecast growth and negotiate proactively

Cost Analysis: With optimizations:

  • 95% cache hit ratio reduces origin load by 20x
  • Volume discounts reduce per-GB cost from $0.05 to $0.01-0.02
  • Multi-CDN strategy saves additional 20-30% through regional optimization
  • Final cost: approximately $50-100M/month (still expensive but sustainable with revenue)

6. Peer-to-Peer (P2P) Delivery:

For future optimization, implement WebRTC-based P2P mesh networks:

  • Viewers with good upload bandwidth become “super peers” that relay segments to nearby viewers
  • Reduces CDN load by 30-50% for popular streams
  • Trade-off: Increased complexity, NAT traversal challenges, potential quality/latency issues
  • Works best as a hybrid model (P2P + CDN fallback)

Deep Dive 5: How do we ensure strong consistency in concurrent operations while maintaining high availability?

Live streaming involves many concurrent operations that require strong consistency: preventing duplicate stream keys, ensuring VOD generation doesn’t start before a stream fully ends, and coordinating transcoding job assignment.

Problem: Distributed State Consistency

Race Condition Examples:

  • Two streams starting with the same stream key simultaneously
  • VOD generation starting while segments are still being written
  • Multiple transcoding workers picking up the same stream job
  • Stream metadata updates from multiple sources (viewer count, health metrics)

Without proper coordination, these race conditions can lead to:

  • Duplicate streams broadcasting simultaneously
  • Corrupted VOD files
  • Wasted transcoding resources
  • Inconsistent viewer counts

Solution: Distributed Locking and Transaction Patterns

1. Stream Key Uniqueness:

When a streamer starts broadcasting, we must ensure atomicity:

  • Check if stream key is valid
  • Check if another stream is already using this key
  • Create a new active stream record
  • All must happen atomically

Using Database Transactions:

  • PostgreSQL with serializable isolation level ensures atomic check-and-set
  • Define a unique constraint on active_streams table: UNIQUE(stream_key, status) WHERE status = ‘live’
  • This database-level constraint prevents duplicate streams even under high concurrency

Using Redis Distributed Lock: For operations that span multiple services, use Redis:

  • Acquire lock: SET lock:stream:{streamKey} {nodeId} EX 300 NX
  • The NX flag ensures atomic acquire (only succeeds if key doesn’t exist)
  • EX 300 sets a 5-minute expiration (TTL) in case the holder crashes
  • Only the lock holder can proceed with stream initialization
  • Release lock after stream is fully initialized

2. VOD Generation Coordination:

When a stream ends, we need to ensure:

  • All segments are fully written to origin storage
  • No more segments will be written
  • VOD generation starts exactly once
  • Chat messages are fully persisted

Using Kafka Streams:

  • Stream-ended event is published to Kafka
  • VOD Processing Service consumes with exactly-once semantics
  • Use Kafka’s transactional API to ensure each stream-ended event is processed exactly once
  • Consumer offset is only committed after VOD generation completes
  • If processing fails, the event remains in Kafka for retry

Using Workflow Orchestration:

  • Use a workflow engine like Temporal or AWS Step Functions
  • Define VOD generation as a durable workflow with state checkpointing
  • Each step (collect segments, concatenate, generate thumbnails, upload to S3) is idempotent
  • If the workflow crashes, it resumes from the last checkpoint
  • Retries are automatic with exponential backoff

3. Transcoding Job Assignment:

With hundreds of transcoding workers pulling jobs from a queue:

  • Use Kafka consumer groups for automatic job distribution
  • Each stream is published to a partition based on stream_id hash
  • Consumer group ensures each partition is assigned to exactly one consumer
  • If a consumer crashes, Kafka automatically rebalances and assigns its partitions to healthy consumers
  • This provides both distribution and fault tolerance

Alternative: Job Queue with Visibility Timeout:

  • Use AWS SQS with visibility timeout
  • When a worker pulls a job, it becomes invisible to other workers for a timeout period (e.g., 5 minutes)
  • Worker processes the transcoding job
  • If successful, worker deletes the message from queue
  • If worker crashes, message becomes visible again after timeout for retry by another worker

4. Consistent Viewer Count:

Viewer count must be accurate for analytics and stream ranking:

Eventually Consistent Approach:

  • Each CDN edge location tracks viewer connections
  • Periodically (every 5-10 seconds) sends viewer count delta to central aggregation service
  • Aggregation service sums all deltas and updates the total count
  • Accept that viewer count may lag by 5-10 seconds

Stream Processing Approach:

  • Player heartbeat events are sent to Kafka
  • Use Kafka Streams with windowed aggregation (tumbling window of 10 seconds)
  • Count distinct viewer_ids per stream_id within each window
  • Result is published to Redis for real-time access
  • Provides accurate, eventually consistent viewer counts with minimal delay

5. Database Replication and Consistency:

Master-Replica Pattern:

  • Write operations (create stream, update status) go to master database
  • Read operations (fetch stream metadata, get VOD list) go to read replicas
  • Replication lag is typically under 100ms for streaming replication
  • Accept eventual consistency for reads (viewer sees status update within 100ms)

For Strong Consistency:

  • Critical reads (authentication, payment processing) must read from master
  • Use read-after-write consistency: after writing, immediately read from master to confirm
  • For PostgreSQL, use synchronous replication to ensure replicas are fully caught up before confirming writes

Deep Dive 6: How do we handle stream health monitoring and automatic recovery?

Live streaming is time-sensitive - if a stream goes down or degrades, it immediately impacts viewer experience. We need real-time monitoring and automatic remediation.

Problem: Stream Failures and Degradation

Common Issues:

  • Streamer’s internet connection drops
  • Ingest server crashes or becomes unresponsive
  • Transcoding worker fails mid-stream
  • CDN edge location has degraded performance
  • Network congestion causes packet loss and buffering

Without monitoring and recovery, these issues result in stream downtime, frustrated viewers, and lost revenue.

Solution: Multi-Layer Monitoring with Automatic Failover

1. Ingest Layer Monitoring:

Connection Health Metrics:

  • Track dropped frames reported by OBS
  • Monitor bitrate stability (significant fluctuations indicate network issues)
  • Measure ingest server CPU and network utilization
  • Track RTMP connection errors and reconnection attempts

Automatic Recovery:

  • If ingest server CPU exceeds 90% for 30 seconds, stop accepting new streams and redirect to other servers
  • If a streamer’s connection drops, the ingest server waits 60 seconds for reconnection before marking the stream as ended
  • Upon reconnection, resume the stream seamlessly without creating a new stream ID
  • Notify streamer via dashboard if persistent connection issues are detected

2. Transcoding Layer Monitoring:

Job Health Metrics:

  • Track transcoding latency (time from receiving raw frame to outputting transcoded segment)
  • Monitor GPU utilization and temperature
  • Detect hung jobs (no segment output for 30+ seconds)
  • Track segment discontinuities and missing segments

Automatic Recovery:

  • If a transcoding job is detected as hung, kill the pod and immediately start a replacement
  • The replacement worker picks up from the last successfully transcoded segment
  • Viewers experience a brief freeze (5-10 seconds) but the stream continues
  • For critical streams (partners, large viewership), maintain a hot standby transcoder that takes over instantly

3. CDN Layer Monitoring:

Edge Health Metrics:

  • Track response times from each CDN edge location
  • Monitor error rates (4xx, 5xx status codes)
  • Measure cache hit ratios
  • Track viewer-reported buffering events

Automatic Failover:

  • If a CDN edge location has elevated error rates or response times, mark it as degraded
  • DNS-based routing directs new viewers to alternative edge locations
  • Existing viewers’ players automatically retry failed segment requests
  • HLS players have built-in segment retry logic (typically 3 retries with exponential backoff)
  • If retries fail, player falls back to alternative CDN or lower quality variant

4. Stream Quality Monitoring:

Viewer Experience Metrics:

  • Track video startup time (time from request to first frame)
  • Monitor buffering ratio (% of viewing time spent buffering)
  • Measure quality switches (up/down) frequency
  • Track viewer drop-off rate (viewers leaving within first 30 seconds often indicates quality issues)

Adaptive Quality Adjustments:

  • If many viewers are buffering, automatically reduce the source stream bitrate recommendation (notify streamer)
  • If a specific quality variant has high error rates, temporarily disable it
  • Use viewer feedback to identify and isolate problematic CDN edge locations

5. Alert System:

Severity Levels:

  • P0 (Critical): Stream completely down, affects 1000+ viewers or partner streamer
  • P1 (High): Significant degradation, high buffering rate, affects 100+ viewers
  • P2 (Medium): Minor issues, single component degraded but no viewer impact
  • P3 (Low): Informational, metrics outside normal range but no action needed

Alert Routing:

  • P0: Page on-call engineer immediately via PagerDuty, create incident channel
  • P1: Send alert to Slack channel, assign to on-call engineer
  • P2: Create JIRA ticket for follow-up during business hours
  • P3: Log for analytics and trend analysis

Automated Remediation:

  • P0 alerts trigger automatic failover (switch to backup ingest/transcoding/CDN)
  • P1 alerts trigger partial failover (reduce load on affected component)
  • P2 alerts log and wait for pattern confirmation before action
  • Human intervention is only needed if automatic recovery fails

6. Post-Incident Analysis:

After any incident:

  • Automatically collect logs from all involved services
  • Generate timeline of events from distributed traces
  • Calculate viewer impact (number affected, duration)
  • Create blameless postmortem document
  • Identify action items to prevent recurrence
  • Update runbooks and alerts based on learnings

Step 4: Wrap Up

In this design, we proposed a comprehensive system for a live streaming platform like Twitch. If there is extra time at the end of the interview, here are additional points to discuss:

Additional Features:

  • Clips: Allow viewers to create 30-60 second clips from live streams or VODs, which are separately transcoded and stored as shareable content.
  • Subscriptions and Monetization: Payment processing through Stripe or PayPal, subscription tiers (Tier 1, 2, 3), bits/donations, revenue sharing, payout processing.
  • Follow System: Users follow streamers and receive push notifications when they go live using APNs and FCM.
  • Stream Discovery: Elasticsearch-powered search and browse by category, game, language, and tags. Machine learning for personalized recommendations using collaborative filtering and content-based filtering.
  • Moderation Tools: Chat moderation (bans, timeouts, slow mode), automated spam detection, content moderation for stream video using ML models.
  • Advanced Analytics: Real-time dashboards for streamers showing viewer count, chat activity, revenue, stream health metrics, and retention graphs.

Scaling Considerations:

  • Horizontal Scaling: All services are stateless (except WebSocket chat servers which use sticky sessions) to enable horizontal scaling.
  • Database Sharding: Shard by geographic region or channel_id for PostgreSQL tables to distribute load.
  • Caching Layers: Multi-tier caching with Redis for hot data, edge caching at CDN, and client-side caching.
  • Message Queue Scaling: Kafka with 100+ partitions allows parallel processing across many consumers.
  • Auto-Scaling: Use Kubernetes HPA (Horizontal Pod Autoscaler) for automatic scaling based on CPU, memory, and custom metrics (queue depth, viewer count).

Error Handling:

  • Network Failures: HLS players have built-in retry logic. Implement exponential backoff for segment requests.
  • Service Failures: Use circuit breakers (Hystrix, resilience4j) to prevent cascading failures. Fail fast and fallback to degraded experience.
  • Database Failures: Automatic failover to read replicas. Use connection pooling with health checks.
  • Third-Party API Failures: Cache recent mapping API results. Have fallback to alternative routing services.
  • Transcoding Failures: Automatic job retry with exponential backoff. Maintain hot standby for critical streams.

Security Considerations:

  • Stream Key Protection: Treat stream keys as secrets. Rotate periodically. Use TLS for all RTMP connections.
  • Content Security: Implement DRM for premium content. Use signed URLs for CDN access to prevent hotlinking.
  • Authentication: JWT tokens for API access. OAuth integration for third-party apps. Rate limiting to prevent abuse.
  • DMCA Compliance: Automated content ID matching to detect copyrighted content. Takedown procedures.
  • Privacy: Encrypt PII at rest and in transit. GDPR compliance for EU users. User data deletion workflows.

Cost Optimization:

  • Transcoding: Use spot instances for 60-70% of GPU capacity. Dynamic variant generation based on viewer demand. Reduce bitrates for low-motion content.
  • CDN: Multi-CDN strategy with cost-based routing. Optimize cache hit ratios. Volume commitments and negotiation. Consider P2P delivery for popular content.
  • Storage: Lifecycle policies to move VODs to cheaper tiers. Delete old VODs based on channel settings. Use efficient codecs (HEVC/AV1) for archival.
  • Monitoring: Sample metrics rather than tracking every event. Use aggregation windows. Archive detailed logs to cheap storage.

Monitoring and Analytics:

  • Key Metrics: Concurrent viewers, concurrent streams, viewer growth rate, stream uptime, transcoding latency, CDN cache hit ratio, buffering rate, video startup time.
  • Business Metrics: Revenue per viewer, subscription conversion rate, average watch time, viewer retention, creator payouts.
  • Real-Time Dashboards: Grafana dashboards for operations. Custom dashboards for streamers showing their channel analytics.
  • A/B Testing: Framework for testing pricing, UI changes, recommendation algorithms. Statistical significance testing.

Future Improvements:

  • Ultra-Low Latency: WebRTC-based delivery for sub-second latency. Trade-off is infrastructure complexity and cost.
  • AI-Powered Features: Automatic highlight generation, content moderation, personalized recommendations using deep learning.
  • Interactive Streaming: Polls, predictions, channel points, interactive overlays that viewers can influence.
  • Co-Streaming: Multiple streamers broadcasting together with synchronized streams.
  • 4K and HDR: Support for higher quality formats as bandwidth and device capabilities improve.
  • AV1 Codec: Next-generation codec with 30% better compression than H.264. Reduces bandwidth costs but requires more transcoding power.

Congratulations on getting this far! Designing Twitch is a complex system design challenge that requires balancing low latency, massive scale, real-time interactions, and cost efficiency. The key is to start with core functionality, then systematically address each non-functional requirement with proven distributed systems patterns.


Summary

This comprehensive guide covered the design of a live streaming platform like Twitch, including:

  1. Core Functionality: Live video ingestion via RTMP, real-time transcoding to multiple bitrates, adaptive bitrate streaming via HLS/DASH, real-time chat with WebSockets and Kafka, automatic VOD generation.
  2. Key Challenges: Low latency video delivery (2-5 seconds), scaling transcoding to 100,000+ concurrent streams, chat at scale (1M+ messages/second), CDN cost optimization, strong consistency in distributed operations, stream health monitoring and recovery.
  3. Solutions: Optimized transcoding pipeline with GPU acceleration, dynamic quality variant generation, geographically distributed ingest, multi-CDN strategy with intelligent caching, Kafka-based chat architecture, distributed locking with Redis, comprehensive monitoring with automatic failover.
  4. Scalability: Horizontal scaling at every layer, Kubernetes orchestration, auto-scaling based on demand, database sharding, multi-region deployments, efficient caching strategies.

The design demonstrates how to build a production-ready live streaming platform that handles massive scale while maintaining low latency, high availability, and cost efficiency. Every component is designed for horizontal scaling and fault tolerance, ensuring the system can grow with user demand.