Design Reddit

Reddit is a social news aggregation, content rating, and discussion platform where users submit content to communities (subreddits), vote on submissions, and engage in threaded discussions. Designing Reddit presents unique challenges including distributed voting systems, nested comment threading, content ranking algorithms, large-scale moderation, and real-time updates. This document outlines a production-grade architecture capable of supporting 500M monthly active users with millions of subreddits, billions of posts and comments, and real-time interactions.

Step 1: Understand the Problem and Establish Design Scope

Before diving into the design, it’s crucial to define the functional and non-functional requirements. For content platforms like Reddit, we need to carefully consider user interactions, community management, and content discovery mechanisms.

Functional Requirements

Core Requirements:

  1. Users should be able to create and join subreddits (topic-based communities) with different visibility settings.
  2. Users should be able to create posts (text, link, image, video) and comment on posts with nested threading.
  3. Users should be able to upvote and downvote posts and comments, affecting their visibility and ranking.
  4. The system should provide multiple ranking algorithms (Hot, Best, Top, New, Controversial, Rising) for content discovery.
  5. Moderators should be able to manage their communities with automated and manual moderation tools.

Below the Line (Out of Scope):

  • Users should be able to send private messages and participate in chat.
  • Users should be able to give and receive awards (Reddit Gold, Silver, Platinum).
  • Users should be able to save posts and comments for later viewing.
  • Users should be able to follow other users and receive personalized recommendations.
  • Users should be able to schedule posts and set up recurring threads.

Non-Functional Requirements

Core Requirements:

  • The system should prioritize low latency for feed loading (< 200ms p99) and voting operations (< 50ms p99).
  • The system should ensure strong consistency for content creation and eventual consistency for vote counts and rankings.
  • The system should be able to handle massive write throughput with 5 billion votes per day and 500 million comments per day.
  • The system should provide 99.99% uptime with graceful degradation during outages.

Below the Line (Out of Scope):

  • The system should implement robust spam detection and prevention mechanisms.
  • The system should ensure data security and privacy compliance (GDPR, CCPA).
  • The system should provide comprehensive monitoring and observability.

Clarification Questions & Assumptions:

  • Scale: 500M monthly active users, 100M daily active users, 50M posts per day, 500M comments per day, 5B votes per day.
  • Communities: 1M active subreddits with varying sizes from small niche communities to massive default subreddits.
  • Peak Load: 500K concurrent users during major events or viral content surges.
  • Comment Threading: Support infinite nesting depth with lazy loading for deep threads.
  • Vote Fuzzing: Implement vote count obfuscation to prevent manipulation detection.

Step 2: Propose High-Level Design and Get Buy-in

Planning the Approach

For a complex social platform like Reddit, we’ll build the design incrementally, addressing each functional requirement while ensuring the architecture can scale. We’ll start with the core content and voting mechanisms, then layer in ranking algorithms, moderation, and real-time features.

Defining the Core Entities

To satisfy our key functional requirements, we’ll need the following entities:

Subreddit: A community focused on a specific topic. Contains metadata including name, description, type (public/private/restricted), subscriber count, custom rules, and settings. Each subreddit has a hierarchy of users with different permission levels: admins, moderators, contributors, and subscribers.

User: Any registered user on the platform. Contains profile information, karma score (aggregate reputation), post and comment history, subreddit memberships, and preferences. Users build reputation through upvotes on their contributions.

Post: A submission to a subreddit. Can be text-based, link, image, video, or poll. Contains title, content, author, timestamps (created, edited), vote counts, comment count, ranking scores (hot, controversy), and flags (NSFW, spoiler, deleted).

Comment: A response to a post or another comment. Supports nested threading with infinite depth. Contains content, author, timestamps, vote counts, depth level, and a materialized path for efficient tree traversal.

Vote: A user’s upvote or downvote on a post or comment. Stores the relationship between user, content, and vote value (1 for upvote, -1 for downvote). Critical for ranking and karma calculation.

Moderation Action: A record of moderator actions within a subreddit. Includes action type (remove, approve, ban), target content or user, reason, and timestamp. Provides transparency and audit trail for community governance.

API Design

Create Subreddit Endpoint: Used by users to create new communities with custom settings and rules.

POST /subreddits -> Subreddit
Body: {
  name: string,
  title: string,
  description: string,
  type: "public" | "private" | "restricted",
  settings: object
}

Create Post Endpoint: Used by users to submit content to a subreddit.

POST /posts -> Post
Body: {
  subredditId: string,
  title: string,
  content: string,
  postType: "text" | "link" | "image" | "video",
  url?: string,
  isNsfw?: boolean
}

Create Comment Endpoint: Used by users to comment on posts or reply to other comments.

POST /comments -> Comment
Body: {
  postId: string,
  parentId?: string,
  content: string
}

Vote Endpoint: Used by users to upvote or downvote content.

POST /votes -> Success/Error
Body: {
  contentId: string,
  contentType: "post" | "comment",
  voteValue: 1 | -1 | 0
}

Get Feed Endpoint: Used by users to retrieve ranked content for a subreddit or their homepage.

GET /feed?subreddit={name}&sort={algorithm}&time={period} -> Post[]
Query Params:
  subreddit: string (optional, defaults to front page)
  sort: "hot" | "best" | "top" | "new" | "controversial" | "rising"
  time: "hour" | "day" | "week" | "month" | "year" | "all"
  page: number

Get Comments Endpoint: Used by users to retrieve comment threads for a post.

GET /posts/:postId/comments?sort={algorithm} -> Comment[]
Query Params:
  sort: "best" | "top" | "new" | "controversial"
  depth: number (for pagination)

High-Level Architecture

Let’s build up the system sequentially, addressing each functional requirement:

1. Users should be able to create and join subreddits with different visibility settings

The core components for subreddit management are:

  • API Gateway: Entry point for all client requests, handling authentication, rate limiting, and request routing to appropriate microservices.
  • Subreddit Service: Manages CRUD operations for subreddits, membership management (subscribers, moderators), and subreddit settings. Validates community rules and handles access control.
  • PostgreSQL Database: Stores subreddit metadata including name, description, type, created timestamp, subscriber count, and settings as JSONB. Indexed by name and subscriber count for efficient queries.
  • Redis Cache: Caches frequently accessed subreddit metadata and member counts for fast retrieval.

Subreddit Creation Flow:

  1. User submits a create subreddit request through the client, providing name, description, and settings.
  2. The API Gateway authenticates the request and forwards it to the Subreddit Service.
  3. The Subreddit Service validates the name is unique and available, then creates an entry in the PostgreSQL database.
  4. The service initializes the user as the subreddit admin and returns the new subreddit object.
  5. Subreddit metadata is cached in Redis with a reasonable TTL.
2. Users should be able to create posts and comments with nested threading

We extend the architecture to support content creation:

  • Post Service: Handles post creation, updates, and deletion. Manages post metadata including title, content, media URLs, and cross-posting logic. Stores posts in PostgreSQL with sharding by subreddit_id for horizontal scaling.
  • Comment Service: Manages nested comment creation and retrieval using a specialized data model for tree structures. Implements materialized path pattern for efficient traversal of comment hierarchies.
  • Cassandra Database: Stores comments with a schema optimized for write-heavy workloads and nested structure retrieval. Partitioned by post_id with clustering by path for ordered access.
  • Media Storage: Object storage (S3) for uploaded images and videos, with CDN for fast content delivery.

Post Creation Flow:

  1. User creates a post with title, content, and optional media attachments.
  2. The API Gateway routes the request to the Post Service.
  3. For media posts, files are uploaded to S3 and the service stores the URLs.
  4. The Post Service creates a record in PostgreSQL with initial vote counts of zero.
  5. The post is indexed for search and added to the subreddit’s feed cache.

Comment Creation Flow:

  1. User submits a comment on a post or as a reply to an existing comment.
  2. The Comment Service determines the comment’s depth and generates a materialized path (e.g., “1.3.7” for a deeply nested reply).
  3. The comment is stored in Cassandra, partitioned by post_id and clustered by path for efficient retrieval.
  4. Vote counters are initialized in Redis for real-time vote tracking.
  5. A notification is sent to the parent comment’s author or post author.
3. Users should be able to upvote and downvote posts and comments

We introduce voting infrastructure with high throughput requirements:

  • Voting Service: Processes upvotes and downvotes with validation and deduplication. Implements rate limiting to prevent abuse and vote brigading detection.
  • Redis (Voting Layer): Stores real-time vote counters and user voting history. Uses sorted sets for ranking and hash maps for tracking individual user votes.
  • Cassandra (Vote Persistence): Stores historical voting data for durability and analytics. Votes are written asynchronously from Redis to Cassandra.
  • Kafka Message Queue: Buffers vote events for asynchronous processing, ensuring no votes are lost during high traffic periods.

Voting Flow:

  1. User clicks upvote or downvote on content, sending a request to the Voting Service.
  2. The service validates the user hasn’t exceeded rate limits (100 votes per minute).
  3. It checks Redis for any existing vote by this user on this content.
  4. The vote is recorded in Redis, updating counters atomically.
  5. A vote event is published to Kafka for batch writing to Cassandra.
  6. The updated vote count triggers score recalculation for ranking algorithms.
4. The system should provide multiple ranking algorithms for content discovery

We add sophisticated ranking and feed generation:

  • Feed & Ranking Service: Generates personalized and subreddit-specific feeds using various ranking algorithms. Pre-computes and caches popular feeds for performance.
  • Ranking Algorithm Engine: Implements Hot (time-decay with logarithmic voting), Best (Wilson score confidence interval), Top (net votes by time period), Controversial (polarized content), New (chronological), and Rising (fast-growing) algorithms.
  • Redis Sorted Sets: Stores pre-computed rankings for each subreddit and algorithm combination. Keys like “hot:r/programming” contain sorted post IDs by score.
  • Background Workers: Continuously recalculate ranking scores for active posts (less than 48 hours old) every 5 minutes.

Feed Generation Flow:

  1. User requests a feed for a subreddit with a specific sorting algorithm (e.g., Hot).
  2. The Feed Service checks Redis for a cached sorted set of post IDs.
  3. If cached, it retrieves the top N posts and hydrates with metadata from PostgreSQL and vote counts from Redis.
  4. If not cached, it calculates scores on-demand using the algorithm and caches the result.
  5. Post data is enriched with user-specific information (has user voted, is saved, etc.).
  6. The feed is returned to the client with pagination support.
5. Moderators should be able to manage their communities with automated and manual moderation tools

We complete the core system with moderation capabilities:

  • Moderation Service: Implements automated rule engine (AutoModerator) and manual moderation actions. Processes flagged content and manages ban lists.
  • AutoModerator Engine: Evaluates YAML-based rules against incoming content, applying actions like remove, filter to queue, or approve based on conditions.
  • ML Spam Detection: Uses machine learning models trained on historical removed content to predict spam probability and flag suspicious posts.
  • Mod Queue Database: Stores flagged content awaiting moderator review in PostgreSQL with indexes for efficient queue management.
  • Mod Actions Log: Audit trail of all moderator actions for transparency and accountability.

Automated Moderation Flow:

  1. User submits a post or comment, which is published to a Kafka moderation topic.
  2. Moderation workers consume events and load subreddit rules (cached in memory).
  3. The AutoModerator engine evaluates conditions like keyword matching, account age, karma thresholds, and domain filters.
  4. If rules are violated, appropriate actions are executed (remove, filter, send message).
  5. ML models predict spam probability; high-confidence spam is auto-removed.
  6. Borderline cases are added to the mod queue for human review.
  7. All actions are logged to the mod actions table.

Step 3: Design Deep Dive

With the core functional requirements satisfied, let’s dig into critical architectural challenges and optimizations.

Deep Dive 1: How do we implement Reddit’s Hot ranking algorithm efficiently at scale?

The Hot algorithm is Reddit’s signature ranking mechanism that surfaces trending content by balancing vote score with recency. The challenge is calculating and updating scores for millions of posts efficiently.

The Hot Algorithm:

The Hot score uses a logarithmic scale for votes to prevent old highly-voted posts from dominating, combined with a time component that favors recent content. The formula combines the sign and magnitude of the vote score with seconds since epoch divided by 45,000 (approximately 12.5 hours).

The score is calculated as: log base 10 of the absolute vote score (minimum 1) multiplied by the sign of the score, plus the timestamp in seconds from Reddit’s launch date divided by 45,000. This creates a time decay where a post needs exponentially more votes to stay at the same ranking as it ages.

Implementation Strategy:

Rather than calculating Hot scores on every feed request, we use a pre-computation strategy with background workers. Every 5 minutes, workers recalculate Hot scores for posts less than 48 hours old since older posts rarely appear in Hot feeds.

The updated scores are stored in a “hot_score” column in the posts table and simultaneously updated in Redis sorted sets keyed by subreddit. This allows us to retrieve the top posts with a simple sorted set range query without real-time computation.

For extremely popular subreddits with millions of posts, we maintain separate Redis sorted sets that are continuously updated as votes come in, rather than bulk updates. This ensures the most active communities always have up-to-date rankings.

Caching Strategy:

We cache the top 1000 posts per subreddit per ranking algorithm in Redis sorted sets. When a user requests a feed, we simply query the sorted set with a range operation to get the top 25 posts. This provides sub-millisecond response times for the most common queries.

Cache invalidation happens through vote events. When a post receives votes, we update its score in the sorted set asynchronously. If a post rises into the top 1000, it’s added to the cache. This event-driven approach keeps caches fresh without constant recalculation.

Deep Dive 2: How do we implement the Best algorithm using Wilson score confidence interval?

The Best algorithm solves a statistical problem: given limited voting data, what’s the probability this content is actually good? A post with 10 upvotes and 0 downvotes should rank differently than one with 1000 upvotes and 100 downvotes, even though both have 100% upvote ratios.

Wilson Score Approach:

The Wilson score confidence interval provides a lower bound estimate of the “true” quality of content. It calculates a confidence interval around the upvote ratio, accounting for sample size. Content with more votes has a tighter confidence interval, so we can be more certain of its quality.

The algorithm considers the total number of votes, the proportion of upvotes, and a z-score for the desired confidence level (typically 95%, z=1.96). The formula produces a score between 0 and 1, where higher values indicate higher-quality content based on voting patterns.

Real-Time Calculation:

Unlike Hot scores that are pre-computed, Best scores are calculated on-demand during feed generation. This is feasible because the calculation is relatively inexpensive compared to database queries. We fetch the top posts by a combination of net score and recency, then apply the Wilson score calculation to rank them.

The Voting Service maintains real-time upvote and downvote counts in Redis for each post and comment. When generating a Best feed, we retrieve these counts and calculate Wilson scores in the application layer before sorting and returning results.

For comment sorting within threads, Best is the default algorithm. Comments are retrieved from Cassandra with their vote counts, Wilson scores are calculated, and comments at each thread level are sorted accordingly before being returned to the client.

Deep Dive 3: How do we handle high-throughput voting with 5 billion votes per day?

With 5 billion votes per day, that’s approximately 57,870 votes per second on average, with significantly higher peaks during viral events. Traditional database writes would be prohibitively expensive and slow.

Two-Tier Voting Architecture:

We use a two-tier system: Redis for real-time vote counting and Cassandra for durable storage. This separates the hot path (user voting) from the cold path (persistence and analytics).

When a user votes, the Voting Service performs several Redis operations atomically: check if the user has already voted on this content, update or create the user’s vote record, increment or decrement vote counters for the content, and update the user’s karma score. All of this happens in milliseconds.

The Redis vote structure uses multiple data types efficiently. A hash map stores individual user votes with keys like “vote:user:789:post:12345” with values 1 or -1. Separate counters track aggregate scores with keys like “vote:post:12345:score”, “vote:post:12345:upvotes”, and “vote:post:12345:downvotes”.

Asynchronous Persistence:

Vote events are published to Kafka topics for asynchronous processing. Consumer workers batch votes and write them to Cassandra in bulk operations every 30 seconds. This reduces write load on Cassandra by several orders of magnitude while maintaining durability through Kafka’s persistent log.

If Redis fails or is restarted, we can rebuild vote counters from Cassandra’s historical data. The system maintains eventual consistency between Redis (real-time) and Cassandra (durable), with Redis always being the source of truth for current counts.

Vote Fuzzing:

To prevent vote manipulation and bot detection, Reddit doesn’t show exact vote counts. The displayed counts have noise added based on a fuzzing factor (typically 5%). Each user sees slightly different numbers, cached per session, making it impossible to determine if individual votes registered.

The true vote count is used for ranking calculations, but when serving post or comment data to clients, we apply fuzzing to the displayed numbers. This discourages vote manipulation while maintaining accurate content ranking.

Rate Limiting and Validation:

The Voting Service implements strict rate limiting at 100 votes per minute per user. Before processing any vote, we check Redis to ensure the user hasn’t exceeded this limit. Suspicious voting patterns trigger additional checks.

We also implement vote brigading detection using machine learning models that identify coordinated voting patterns. Users caught manipulating votes may be shadowbanned, where their votes are accepted but don’t actually affect scores, preventing them from knowing they’ve been detected.

Deep Dive 4: How do we efficiently store and retrieve nested comments with infinite depth?

Reddit’s comment system supports unlimited nesting depth, creating complex tree structures that can be thousands of comments deep. Traditional relational approaches with recursive queries don’t scale well for this use case.

Materialized Path Pattern:

We use the materialized path pattern where each comment stores a string representing its position in the tree. For example, a top-level comment might have path “1”, its first reply has path “1.3”, and a reply to that has path “1.3.7”. This encoding makes tree traversal efficient.

When creating a new comment, the Comment Service determines its parent’s path (or generates a new root path for top-level comments), appends the new comment’s ID, and stores this as the comment’s path. The depth is also stored as an integer for quick filtering.

Cassandra Schema Optimization:

Comments are stored in Cassandra partitioned by post_id and clustered by path in ascending order. This means all comments for a post are co-located on the same nodes, and they’re stored in tree traversal order on disk.

To retrieve all comments for a post, we simply query by post_id, and Cassandra returns them pre-sorted by path. To retrieve a specific comment subtree (all replies under a comment), we use a range query where path is greater than or equal to the parent’s path but less than the next sibling.

For example, to get all replies under comment with path “1.3”, we query for paths >= “1.3” and < “1.4”. This efficiently fetches the entire subtree in sorted order without recursive queries or multiple round trips.

Lazy Loading and Pagination:

For posts with thousands of comments, we don’t load everything at once. The initial request fetches top-level comments (depth = 0) sorted by the selected algorithm (Best, Top, New, Controversial). For each top-level comment, we also fetch its top 3 immediate replies.

When users click “load more comments”, we fetch additional comments at that nesting level with the same parent. For extremely deep threads, we use “continue this thread” links that load a new page starting from that point in the hierarchy.

Comment Sorting:

Comments at each level of the tree can be sorted independently. We retrieve comments from Cassandra with their vote counts from Redis, apply the selected sorting algorithm (typically Wilson score for Best), and sort comments at each depth level before returning to the client.

This allows users to see the “best” replies to each comment, not just the best top-level comments, creating a richer discussion experience where quality rises throughout the tree.

Deep Dive 5: How do we scale subreddit moderation for millions of communities?

With over 1 million active subreddits, centralized moderation is impossible. Reddit’s solution is distributed community moderation with powerful automated tools.

AutoModerator Rule Engine:

Each subreddit can configure custom moderation rules in a YAML-like format. Rules specify conditions (domain filters, keyword matching, account age thresholds, karma requirements) and actions (remove, approve, filter to mod queue, send message to user).

When content is created, it’s published to a Kafka “moderation” topic. Worker processes consume these events, load the subreddit’s rules from cache (or database), and evaluate each rule sequentially. Rules are stored as JSONB in PostgreSQL for flexibility.

The rule engine supports complex logic including regex pattern matching, user flair requirements, post flair validation, and time-based rules. Conditions can be combined with AND/OR operators for sophisticated filtering.

ML-Based Spam Detection:

In addition to rule-based moderation, we employ machine learning models trained on historical moderation actions. Features include account age, total karma, posting frequency, text content analysis (toxicity scores), link domains, and user behavior patterns.

The model outputs a spam probability score. Content with scores above 0.9 is automatically removed and logged. Content with scores between 0.5 and 0.9 is filtered to the mod queue for manual review. This catches spam that evades rule-based filters while minimizing false positives.

Models are trained separately for different content types (posts vs comments) and periodically retrained on recent moderation decisions to adapt to evolving spam tactics.

Mod Queue Management:

Content flagged by AutoModerator or ML models is added to the mod queue, stored in a PostgreSQL table indexed by subreddit_id, status (pending/approved/removed), and flagged timestamp. Moderators access this queue through a dedicated interface.

When moderators review content, they can approve (making it visible), remove (hiding it), or take additional actions like banning the user. All moderator actions are logged to a separate mod_actions table for transparency and potential appeals.

The mod queue implements priority sorting where content with higher spam scores, more user reports, or from new accounts appears first. This helps moderators focus on the most problematic content.

Scaling Moderation Across Communities:

Each subreddit has its own mod team and rules, making moderation naturally distributed. Large subreddits often have dozens of moderators, while smaller communities may have just one or two. The system scales horizontally since subreddits operate independently.

For site-wide rule violations (illegal content, harassment), administrators have global moderation powers. These actions are processed through a separate high-priority pipeline with human review for serious cases.

Deep Dive 6: How do we implement real-time comment updates for live discussions?

During major events like breaking news or live sports, Reddit threads can receive hundreds of comments per minute. Users expect to see new comments appear without manually refreshing.

WebSocket Architecture:

When users open a post, the client establishes a WebSocket connection to a dedicated WebSocket server cluster, separate from the REST API servers to isolate connection management from request processing.

Upon connection, the client subscribes to the specific post’s channel. The WebSocket server subscribes to a Redis Pub/Sub channel with a key like “post:12345:comments”. All WebSocket servers for a post subscribe to the same channel, enabling message fanout.

When a new comment is created through the Comment Service, after storing it in Cassandra, the service publishes an event to the Redis Pub/Sub channel. This event includes the comment ID, parent ID, author username, content, and timestamp.

Message Fanout:

Redis Pub/Sub handles message distribution to all subscribed WebSocket servers. Each server maintains a registry of which client connections are interested in which posts. When receiving a message, the server pushes it to all connected clients subscribed to that post.

The client receives the comment data and dynamically injects it into the DOM at the appropriate position in the comment tree, based on the parent ID. The new comment appears with a subtle animation to draw attention without disrupting reading.

Scaling WebSocket Connections:

WebSocket connections are stateful and resource-intensive. Each server can handle approximately 50,000 concurrent connections with 64GB RAM. We horizontally scale by adding more WebSocket servers and using load balancers to distribute connections.

Redis Pub/Sub naturally scales across server instances since all servers subscribe to the same channels. There’s no need for sticky sessions or complex state management.

For mobile apps or environments where WebSockets aren’t available, we provide a fallback polling mechanism where clients request new comments every 10 seconds. This is less efficient but ensures functionality in all environments.

Connection Management:

WebSocket connections are monitored for activity. If a client doesn’t send any messages (pings or subscription updates) for 5 minutes, the server closes the connection to free resources. Clients automatically reconnect if they’re still viewing the post.

When a client navigates away from a post, it sends an unsubscribe message before closing the connection. This allows the server to clean up subscriptions immediately rather than waiting for timeout.

Deep Dive 7: How do we implement full-text search across billions of posts and comments?

Reddit’s search functionality needs to index and query billions of text documents with sub-second latency, supporting filters by subreddit, time period, and sort order.

Elasticsearch Architecture:

We use Elasticsearch as our search engine, with documents representing posts and comments. Each document contains fields for ID, type (post or comment), subreddit, author, title, content, score, created timestamp, and comment count.

The index uses the English analyzer for text fields, which handles stemming (running → run), stop word removal, and tokenization. The title field has a “keyword” subfield for exact matching, useful for finding specific posts.

Indexing Pipeline:

When posts or comments are created, events are published to a Kafka “search_indexing” topic. Dedicated Elasticsearch indexer workers consume these events and index documents in Elasticsearch. This asynchronous approach prevents search indexing from slowing down content creation.

Documents are indexed with a 30-second refresh interval, providing near real-time search. For critical use cases requiring immediate indexing, we can force a refresh, though this is avoided for performance reasons.

Sharding Strategy:

For very popular subreddits with millions of posts, we use separate shards. This allows search queries scoped to a single subreddit to only hit relevant shards, improving performance. Smaller subreddits share shards.

Each shard has 2 replicas for high availability and read scaling. If a node fails, replica shards on other nodes continue serving requests without disruption.

Query Optimization:

Search queries use Elasticsearch’s bool query combining must (required conditions), should (optional scoring factors), and filter (non-scored conditions). The title field is boosted 3x relative to content, making title matches appear higher in results.

For time-based filtering, we use range queries on the created timestamp field. For subreddit filtering, we use term queries on the subreddit keyword field. These filters are cached by Elasticsearch for performance.

Common queries like “hot posts in r/programming” are cached at the application layer with a 5-minute TTL. This dramatically reduces Elasticsearch load for popular searches.

Scaling Search:

Our Elasticsearch cluster consists of 15 nodes with 5 primary shards and 2 replicas per shard. This provides high throughput and redundancy. We use separate indexes per month for time-based queries, allowing us to archive old indexes to cheaper storage.

For autocomplete functionality (suggesting subreddit names as users type), we use a separate completion suggester index with edge n-grams, providing instant suggestions without full-text search overhead.

Step 4: Wrap Up

In this chapter, we designed a comprehensive system architecture for Reddit, handling billions of content interactions daily. If there’s extra time, here are additional topics to discuss:

Additional Features:

  • Awards and Gilding: A virtual currency system where users can purchase coins and award posts/comments. This involves payment processing integration, coin balance management (stored in PostgreSQL with ACID transactions), and award transaction logging. Awards provide benefits like Reddit Premium subscriptions and bonus coins.

  • Private Messaging and Chat: Direct communication between users using a separate messaging service. Messages are stored in Cassandra partitioned by conversation ID for efficient retrieval. Real-time delivery uses WebSockets similar to comment updates.

  • Cross-Posting: Sharing a post to multiple subreddits while maintaining a link to the original. The system stores cross-post relationships and aggregates vote counts across all instances while showing the original context.

  • User Profiles and Karma: Aggregate reputation scores calculated from upvotes received on posts and comments. Karma is stored in Redis for real-time updates and periodically synced to PostgreSQL. User profiles display post history, comment history, and saved content.

  • Content Recommendations: Machine learning models that suggest subreddits based on browsing history and analyze content similarity for personalized feeds. This involves a separate recommendation service with collaborative filtering algorithms.

Scaling Considerations:

  • Database Sharding: PostgreSQL posts table is sharded by subreddit_id, distributing load across multiple database servers. Each shard handles a subset of subreddits, with popular subreddits potentially getting dedicated shards.

  • Read Replicas: PostgreSQL master-replica setup with PgBouncer connection pooling. Read queries are distributed across replicas while writes go to the master. This dramatically improves read throughput for metadata queries.

  • Table Partitioning: The posts table is partitioned by created_at timestamp with monthly partitions. Old partitions can be archived to cheaper storage, and queries filtering by time are much faster since they only scan relevant partitions.

  • CDN for Media: All uploaded images and videos are stored in S3 and served through CloudFront CDN. This reduces origin server load and provides low-latency media delivery globally.

  • Multi-Region Deployment: Deploy the entire stack in multiple geographic regions (US, Europe, Asia) with geographic load balancing routing users to the nearest region. Cross-region data replication ensures eventual consistency globally.

Error Handling and Resilience:

  • Circuit Breakers: Protect against cascading failures by stopping requests to failing services. If the Comment Service is down, the circuit breaker prevents vote requests from timing out, instead returning a cached response or graceful error.

  • Graceful Degradation: During partial outages, disable non-critical features while maintaining core functionality. For example, if Elasticsearch is down, disable search but keep posting and commenting working.

  • Rate Limiting: Implement token bucket rate limiting per user and per IP address to prevent abuse. Different limits apply to different operations: 100 votes/minute, 10 posts/hour, 100 comments/hour.

  • Retry Logic with Backoff: For transient failures, implement exponential backoff retries. However, be careful with retries on write operations to avoid duplicate content creation. Use idempotency keys for safe retries.

Monitoring and Observability:

  • Key Metrics: Track votes per second, comment creation latency (p50, p95, p99), cache hit rates for Redis, hot score calculation time, search query latency, and mod queue depth per subreddit.

  • Distributed Tracing: Use OpenTelemetry or similar to trace requests across microservices. This helps identify bottlenecks when a single user request touches multiple services (API Gateway → Post Service → Vote Service → Notification Service).

  • Alerting: Set up alerts for critical conditions: vote counter drift between Redis and Cassandra, high mod queue wait times (> 1 hour), Elasticsearch indexing lag (> 5 minutes), abnormal voting patterns indicating brigading, and service error rates exceeding thresholds.

  • Real-Time Dashboards: Operations teams need visibility into system health with dashboards showing request rates, error rates, latency distributions, database connection pool usage, Kafka consumer lag, and cache hit rates.

Security Considerations:

  • Authentication and Authorization: Use JWT tokens for API authentication with short expiration times. Implement role-based access control (RBAC) for moderator actions with fine-grained permissions.

  • Content Validation: Sanitize all user-generated content to prevent XSS attacks. Validate URLs to prevent malicious redirects. Scan uploaded files for malware before storing in S3.

  • DDoS Protection: Use CloudFlare or AWS Shield to protect against volumetric DDoS attacks. Implement application-layer rate limiting to prevent resource exhaustion from coordinated attacks.

  • Data Privacy: Encrypt sensitive data at rest (user emails, passwords) and in transit (TLS for all connections). Implement data retention policies and support GDPR data export/deletion requests.

Cost Optimization:

  • Use Spot Instances: Run background workers (ranking calculation, moderation workers, search indexers) on spot instances to reduce costs by 70%. These can tolerate interruptions since work is idempotent.

  • Cold Storage: Archive posts older than 1 year to S3 Glacier or equivalent cold storage. They’re rarely accessed but must be retrievable for compliance. This saves 90% on storage costs for historical data.

  • Aggressive Caching: With proper cache warming and invalidation strategies, reduce database queries by 90%. Every cache hit saves an expensive database query, dramatically reducing infrastructure costs.

  • Media Compression: Automatically compress uploaded images using WebP format and videos using H.265 codec. This reduces storage and bandwidth costs by 40-60% while maintaining visual quality.

Future Enhancements:

  • Native Video Platform: Move from hosting videos externally to a Reddit-native video service with adaptive bitrate streaming, reducing dependence on third-party platforms.

  • Enhanced Live Features: Improve real-time commenting for breaking news and live events with features like live chat modes, temporary high-frequency updates, and comment streaming.

  • Improved Recommendation Engine: Develop sophisticated ML models for subreddit discovery based on browsing patterns, content similarity, and collaborative filtering. Use graph neural networks to model community relationships.

  • Community Points and Blockchain: Implement decentralized community rewards using blockchain technology, giving users ownership of their reputation and enabling new community governance models.

  • Edge Computing: Deploy ranking algorithms and feed generation closer to users using edge computing platforms like CloudFlare Workers, reducing latency by eliminating data center round trips.

Capacity Planning:

For our target load of 500M monthly active users, the infrastructure requirements are substantial. With 50M posts per day (580 posts/second average), 500M comments per day (5,787 comments/second average), and 5B votes per day (57,870 votes/second average), we need significant capacity.

Infrastructure Sizing:

  • API Servers: 200 instances with auto-scaling based on CPU and request rate. Each instance handles approximately 500 requests per second, providing 100K total QPS capacity.

  • PostgreSQL: 20 shards for posts and subreddit data, each with 2 read replicas (60 total database servers). Each shard handles 50,000 subreddits with PgBouncer connection pooling.

  • Cassandra: 50 nodes across 3 data centers with replication factor 3. Each node handles billions of comments with tunable consistency (typically QUORUM for reads and writes).

  • Redis: 10-node cluster (5 masters, 5 replicas) for vote counters and caching, plus a separate 6-node cluster for Pub/Sub. Each master handles 100K operations per second.

  • Elasticsearch: 15 nodes with 5 primary shards and 2 replicas per shard. Each node indexes 1,000 documents per second and serves 5,000 search queries per second.

  • Kafka: 12 brokers (3 per availability zone across 4 zones) with 30-day retention. Handles 100K messages per second for vote events, moderation events, and search indexing.

Congratulations on getting this far! Designing Reddit touches on numerous distributed systems challenges: high-write-throughput systems, sophisticated ranking algorithms, nested data structures, distributed moderation, and real-time updates. The key is balancing consistency with performance, choosing the right database for each use case, and leveraging caching aggressively.


Summary

This comprehensive guide covered the design of a social news aggregation platform like Reddit, including:

  1. Core Functionality: Subreddit management, content creation (posts and comments), voting systems, ranking algorithms, and moderation tools.

  2. Key Challenges: High-throughput voting (57,870 votes/second), nested comment threading with infinite depth, multiple ranking algorithms, large-scale distributed moderation, and real-time comment updates.

  3. Solutions: Two-tier voting architecture (Redis + Cassandra), materialized path pattern for comments, pre-computed ranking scores with background workers, rule-based and ML-powered moderation, WebSocket-based real-time updates, and Elasticsearch for full-text search.

  4. Scalability: Database sharding by subreddit, aggressive multi-layer caching, asynchronous processing with Kafka, horizontal scaling of all services, and CDN for media delivery.

The design demonstrates how to build a write-heavy social platform with complex ranking requirements, maintaining sub-second latency while handling billions of interactions daily across millions of independent communities.