Design Yelp

Yelp is a local business review and discovery platform that connects consumers with local businesses through user-generated reviews, ratings, photos, and recommendations. With over 244 million reviews and 33 million unique monthly visitors, Yelp helps users discover restaurants, salons, mechanics, and other local services based on location, ratings, and detailed reviews.

Designing Yelp presents unique challenges including efficient geospatial search at scale, handling high read/write ratios, ranking and filtering millions of reviews, detecting fake reviews and spam, managing photo storage and moderation, providing real-time updates for popular businesses, and building sophisticated recommendation engines.

Step 1: Understand the Problem and Establish Design Scope

Before diving into the architecture, let’s clearly define the functional and non-functional requirements. For a platform like Yelp, we need to balance discovery features, user-generated content, and data integrity.

Functional Requirements

Core Requirements (Priority 1-3):

  1. Users should be able to search for businesses by location, category, name, and apply filters (rating, price, open now, etc.).
  2. Users should be able to view detailed business information including reviews, ratings, photos, hours, and contact details.
  3. Users should be able to write reviews, rate businesses (1-5 stars), upload photos, and mark reviews as helpful.
  4. Users should be able to check-in to businesses and earn badges/achievements.
  5. Business owners should be able to claim their business, update information, respond to reviews, and view analytics.

Below the Line (Out of Scope):

  • Users should be able to make reservations at restaurants through Yelp.
  • Users should be able to order food delivery through Yelp.
  • Users should be able to request quotes from service businesses (contractors, repair services).
  • Business owners should be able to run advertising campaigns (Yelp Ads).
  • Users should be able to create and share collections/lists of businesses.
  • Integration with third-party reservation systems (OpenTable, Resy).

Non-Functional Requirements

Core Requirements:

  • The system should prioritize low latency for search queries (< 200ms for geospatial search results).
  • The system should handle high read traffic (90:1 read to write ratio) with millions of concurrent users.
  • The system should ensure data consistency for reviews and ratings (eventual consistency acceptable with < 5 second delay).
  • The system should detect and filter fake reviews and spam using ML models.
  • The system should scale to handle hundreds of millions of businesses and billions of reviews globally.

Below the Line (Out of Scope):

  • The system should comply with data privacy regulations (GDPR, CCPA) and handle user data responsibly.
  • The system should provide 99.9% availability with multi-region failover.
  • The system should support real-time analytics for business owners.
  • The system should have comprehensive monitoring and alerting infrastructure.

Clarification Questions & Assumptions:

  • Scale: 200 million businesses globally, 244 million reviews, 33 million monthly active users.
  • Geographic Coverage: Primarily US, Canada, and select international markets.
  • Search Volume: 100 million searches per day, with peaks during lunch/dinner hours.
  • Review Volume: ~1 million new reviews per day.
  • Photo Volume: 10 million photos uploaded per month.
  • Read:Write Ratio: Approximately 90:1 (users browse much more than they contribute).
  • Search Radius: Default 5 mile radius, expandable to 25+ miles for sparse areas.

Step 2: Propose High-Level Design and Get Buy-in

Planning the Approach

We’ll build the system incrementally, addressing each functional requirement sequentially:

  1. Business search with geospatial queries
  2. Business information display with reviews and ratings
  3. Review creation, photo uploads, and helpful votes
  4. Check-ins and gamification
  5. Business owner dashboard and review responses

Defining the Core Entities

To satisfy our key functional requirements, we’ll need the following entities:

Business: Represents a local business with comprehensive information including name, category, address, coordinates (lat/long), hours of operation, price range, phone number, website, and aggregate metrics (average rating, review count, photo count). Also includes verification status and claim status.

Review: User-generated content about a business including star rating (1-5), review text, timestamp, helpful vote count, funny vote count, cool vote count, and moderation status. Reviews are immutable once posted but can be flagged for moderation.

User: Represents a Yelp user with profile information (name, location, profile photo, review count, photo count, friend count, elite status). Includes reputation metrics like helpful votes received, review distribution, and badges earned.

Photo: Images uploaded by users or business owners. Contains image URL (multiple resolutions), caption, business association, uploader, timestamp, and moderation status. Photos go through ML-based content moderation before being publicly visible.

Check-in: Records when a user visits a business, including timestamp and optional sharing preferences. Used for gamification (badges, mayorships) and to verify review authenticity.

Rating Aggregation: Pre-computed aggregate statistics for each business including overall rating (weighted average), rating distribution (5-star, 4-star, etc.), filtered ratings (by date, user characteristics), and trend analysis.

API Design

Search Businesses Endpoint: Core discovery API that handles geospatial search with filtering.

GET /businesses/search -> SearchResponse
Query Parameters: location, term, categories, radius, price, open_now, rating, sort_by, offset, limit
Response: businesses, total, region center

Get Business Details Endpoint: Retrieves comprehensive information about a specific business.

GET /businesses/:businessId -> BusinessDetails
Response: business, reviews, photos, hours, attributes, coordinates

Get Reviews Endpoint: Retrieves reviews for a business with pagination and filtering.

GET /businesses/:businessId/reviews -> ReviewResponse
Query Parameters: sort, language, offset, limit
Response: reviews, total, rating_distribution

Create Review Endpoint: Allows users to submit a review for a business.

POST /reviews -> Review
Body: businessId, rating, text, photos

Reviews go through spam detection and content moderation before being published.

Upload Photo Endpoint: Handles photo uploads for businesses or reviews.

POST /photos -> Photo
Body: businessId, image, caption

Returns presigned URL for direct S3 upload, then triggers async processing.

Mark Review Helpful Endpoint: Allows users to vote on review helpfulness.

POST /reviews/:reviewId/vote -> Success/Error
Body: vote_type (useful, funny, cool)

Check-in Endpoint: Records a user visit to a business.

POST /checkins -> Checkin
Body: businessId, latitude, longitude

High-Level Architecture

Let’s build up the system sequentially, addressing each functional requirement:

1. Users should be able to search for businesses by location, category, and filters

Core components for geospatial search:

  • Web/Mobile Client: User interface for search, discovery, and interaction.
  • API Gateway: Single entry point for all client requests, handles authentication (JWT), rate limiting, request routing, and API versioning.
  • Search Service: Orchestrates search queries, handles query parsing, applies business logic, and aggregates results from multiple data sources.
  • Elasticsearch Cluster: Primary search engine with geo_point data type for geospatial queries. Indexes business data with location coordinates, categories, ratings, and text fields for full-text search.
  • Business Database (PostgreSQL): Source of truth for business information. Stores comprehensive business data with relational structure.
  • Redis Cache: In-memory cache for hot businesses, recent searches, and frequently accessed data. Implements cache-aside pattern with TTL.
  • CDN (CloudFront): Caches static assets (business photos, logos, maps) and API responses for anonymous users.

Search Flow:

  1. User enters search query (e.g., “pizza near me”) in the client app.
  2. Client sends GET request to the search endpoint with location and term parameters.
  3. API Gateway validates the request, checks rate limits, and routes to Search Service.
  4. Search Service parses the query, identifies intent, and constructs an Elasticsearch query with geo_distance filtering, boolean queries for multiple filters, and function_score for ranking based on rating, review count, and distance.
  5. Elasticsearch returns ranked business IDs and basic metadata (avg 50-100ms).
  6. Search Service enriches results by checking Redis cache for business details. On cache miss, fetches from PostgreSQL and updates cache.
  7. Search Service applies final business logic (personalization, promoted listings) and returns results.
  8. Client displays results with business cards (name, rating, price, distance, photo).
2. Users should be able to view detailed business information including reviews and ratings

Additional components needed:

  • Business Service: Manages business information, handles updates from business owners, and coordinates with other services.
  • Review Service: Manages review lifecycle including creation, retrieval, voting, and moderation.
  • Rating Aggregation Service: Pre-computes and maintains aggregate rating statistics for businesses.
  • PostgreSQL Database: Stores businesses, reviews, users, photos, check-ins with proper relational structure and indexes.

Business Details Flow:

  1. User taps on a business from search results, client requests business details by ID.
  2. API Gateway routes to Business Service.
  3. Business Service checks Redis cache for business details (hot businesses cached).
  4. On cache miss, fetches from PostgreSQL including business information, rating aggregation, photos, and operating hours.
  5. Business Service makes parallel requests to Review Service for top reviews (sorted by helpfulness) and Photo Service for featured photos.
  6. Results are aggregated, cached in Redis (15 minute TTL), and returned to client.
  7. Client renders business page with all information.
3. Users should be able to write reviews, rate businesses, upload photos, and vote on reviews

Additional components:

  • Photo Service: Handles photo uploads, processing, storage, and moderation.
  • Spam Detection Service: ML-based service to detect fake reviews, spam, and policy violations.
  • Content Moderation Service: Automated and human-in-the-loop content moderation.
  • S3 Object Storage: Stores original and processed photos (multiple resolutions).
  • SQS Queues: Asynchronous processing queues for photo processing, spam detection, and rating updates.

Review Creation Flow:

  1. User writes review with rating and optional photos in client app.
  2. If photos included, client uploads directly to S3 using presigned URLs from Photo Service.
  3. Client submits review via POST to the reviews endpoint.
  4. Review Service validates input (rating 1-5, text length, user hasn’t reviewed this business).
  5. Review Service publishes review to SQS queue for async processing.
  6. Multiple workers consume from queue: Spam Detection Service runs ML model to detect fake reviews (analyze text patterns, user history, timing), Content Moderation Service checks for profanity, hate speech, personal information. If photos attached, Photo Service triggers image processing (resize, compress, moderate).
  7. If review passes all checks, it’s persisted to PostgreSQL and marked as “published”.
  8. Rating Aggregation Service is notified to recalculate business rating (async, eventual consistency).
  9. Elasticsearch index is updated with new review count and rating.
  10. Business owner is notified of new review via Notification Service.

Helpful Vote Flow:

  1. User taps “helpful” button on a review.
  2. Client sends POST to the vote endpoint with review ID and vote type.
  3. Review Service validates user hasn’t already voted on this review.
  4. Vote is recorded in database, review’s helpful count is incremented.
  5. Review ranking is updated (reviews with more helpful votes rank higher).
4. Users should be able to check-in to businesses

Check-in Flow:

  1. User taps “Check In” button on business page.
  2. Client sends location coordinates with check-in request.
  3. Business Service validates user is within reasonable distance (~100m) of business location.
  4. Check-in is recorded in database with timestamp.
  5. User’s check-in count for this business is incremented.
  6. Gamification Service evaluates if user earned any badges (e.g., “Regular” after 5 check-ins).
  7. Check-in data is used to verify review authenticity (users who checked in have higher trust score).
5. Business owners should be able to claim businesses and respond to reviews
  • Business Owner Dashboard: Separate web application for business owners.
  • Analytics Service: Provides insights on views, clicks, calls, direction requests.

Review Response Flow:

  1. Business owner logs into dashboard, sees new reviews.
  2. Owner writes response to a review.
  3. Response goes through light moderation, then is published.
  4. Response is stored in database, linked to original review.
  5. User who wrote review is notified of owner response.

Step 3: Design Deep Dive

With the core functionality established, let’s dive deep into the critical technical challenges that define Yelp’s architecture.

Deep Dive 1: How do we implement efficient geospatial search at scale?

Geospatial search is the heart of Yelp - users need to find businesses near them quickly and accurately. With 200 million businesses globally and 100 million daily searches, we need an optimized approach.

Challenge: Traditional SQL databases with lat/long columns require expensive calculations (Haversine formula) for every row to determine distance, resulting in full table scans even with indexes. For 200M businesses, this is prohibitively slow.

Solution: Elasticsearch with Geo Queries + Quad Tree

Elasticsearch Geo Capabilities:

Elasticsearch provides first-class geospatial support through the geo_point data type and specialized geo queries. The geo_distance query finds documents within a specified distance from a point by combining category matching with geographic filtering within a radius. The geo_bounding_box query finds documents within a rectangular area (useful for map viewport searches). The geohash grid aggregation groups businesses by geographic tiles for heat maps and density analysis.

How Elasticsearch Implements Geospatial Search:

Internally, Elasticsearch uses a BKD tree (Block K-Dimensional tree), which is an evolution of K-D trees optimized for disk-based storage. For geospatial data, latitude and longitude are encoded into a single long value using a space-filling curve (similar to Geohash). The BKD tree recursively partitions space into smaller blocks. Each block is stored in a compact binary format on disk. Search starts at the root and prunes entire subtrees that can’t contain results. Typical query time is O(log N + K) where K is the number of results.

Quad Tree for In-Memory Optimization:

For extremely hot searches and real-time filtering of active businesses, we maintain an in-memory quad tree. A quad tree recursively divides 2D space into four quadrants. Each node represents a geographic region and contains a boundary (lat/long min/max), a list of business IDs in this region (up to a threshold, e.g., 50), and four child nodes (NE, NW, SE, SW) if threshold exceeded.

Quad Tree Search Algorithm: The search algorithm starts at the root node and checks if the node’s boundary intersects with the search circle. If not, it prunes that branch. If the node is a leaf, it checks each business in the node to see if the distance to the search point is within the radius, adding matching businesses to results. Otherwise, it recursively searches each child node.

Hybrid Approach: Use Elasticsearch for complex queries with multiple filters (category, rating, price, text search). Use in-memory Quad Tree for simple proximity searches on hot data (recently updated businesses, currently open). The query optimizer decides which data structure to use based on query complexity.

Caching Strategy: Cache keys are computed by hashing location, radius, filters, and sort parameters. Cache values contain the list of business IDs plus metadata. TTL is 5 minutes for anonymous users and 1 minute for logged-in users (for personalization). Common searches like “pizza 94102” or “restaurants near Union Square” have cache hit rates of 80%+, reducing Elasticsearch load significantly.

Database Sharding: For the PostgreSQL database, we shard by geographic region (city or metro area). For example, US-West-1 covers California, Nevada, and Arizona, while US-East-1 covers New York, New Jersey, and Pennsylvania. Each shard is a separate PostgreSQL instance with read replicas. The Search Service routes queries to appropriate shards based on search location. For cross-shard searches (rare, like “pizza in USA”), we use a scatter-gather approach.

Deep Dive 2: How do we rank and filter millions of reviews effectively?

With 244 million reviews on the platform, displaying the most relevant reviews to users is critical for discovery and trust. Users shouldn’t have to scroll through hundreds of reviews to find useful information.

Challenge: Reviews vary widely in quality, length, and helpfulness. Older reviews may be outdated (restaurant changed chef, store under new management). Some reviews are fake or incentivized. We need to balance recency, helpfulness, and diversity.

Solution: Multi-Signal Review Ranking Algorithm

Review Ranking Signals:

Helpful Votes (Primary Signal): Users can mark reviews as “Useful”, “Funny”, or “Cool”. We apply weighted scoring where Useful equals 1.0, Funny equals 0.7, and Cool equals 0.5. We apply time decay so votes in last 30 days are worth more than votes from 2 years ago. The helpful score formula combines these weighted votes multiplied by a time decay factor.

Reviewer Credibility: Elite status (Yelp’s trusted reviewers) receives a 1.5x multiplier. Review count uses logarithmic scaling to prevent gaming. We consider historical helpful vote ratio (percentage of reviewer’s reviews marked helpful), account age (older accounts get slight boost), and check-in verification (reviews from users who checked in get boost).

Review Quality Metrics: Reviews between 100-500 words get a boost (too short lacks detail, too long rarely read). Photos attached add points to the score. Reviews that receive responses from owners get a slight boost (indicates engagement). Reviews edited multiple times may indicate lower quality.

Recency with Decay: Recent reviews (under 30 days) get 1.0x multiplier. Reviews 30-90 days old get 0.9x, 90-180 days get 0.8x, 180-365 days get 0.7x, 1-2 years get 0.5x, and reviews over 2 years old get 0.3x. Decay slower for businesses with few reviews (under 10).

Review Diversity: Ensure top reviews represent different rating levels (not all 5-star or all 1-star). Use clustering to identify different topics/aspects mentioned. Promote reviews covering different aspects (food, service, ambiance).

Ranking Formula: The final score combines helpful score (40% weight), credibility score (30% weight), quality score (20% weight), and recency score (10% weight), then multiplied by a diversity bonus.

Implementation:

Reviews are pre-ranked and cached. When a new review is posted or voted on, we trigger an async ranking update. We store top 20 review IDs per business in a Redis sorted set with the key pattern business:businessId:top_reviews. The score is the final ranking score with a TTL of 1 hour.

For dynamic filtering (e.g., “show only 1-star reviews”), we fall back to database query but still use pre-computed scores.

The database schema includes a reviews table with columns for review ID, business ID, user ID, rating (1-5 constrained), text, timestamps, vote counts (helpful, funny, cool), ranking score (pre-computed for sorting), spam score (from ML model), moderation status, photo indicator, and owner response indicator. A unique constraint prevents users from reviewing the same business multiple times. Indexes on business ID with ranking score descending (partial index filtering for approved status), business ID with creation date descending, and business ID with rating and ranking score enable efficient queries.

Deep Dive 3: How do we compute and update business ratings efficiently?

Every new review changes a business’s average rating, but we can’t afford to recalculate 200M business ratings on every review submission. We need efficient incremental updates.

Challenge: With 1M new reviews per day (approximately 12 reviews/second sustained, peaking at 100+ reviews/second), the simple approach of calculating average ratings from all reviews is expensive for businesses with 10,000+ reviews. The rating must be updated within 5 seconds for good user experience (eventual consistency acceptable).

Solution: Weighted Average with Incremental Updates

Weighted Rating Formula (Bayesian Average):

To prevent new businesses with only a few 5-star reviews from ranking above established businesses, we use a weighted average. The formula is: weighted_rating = (C × m + R × v) / (C + v), where R is the average rating for the business, v is the number of reviews for the business, m is the minimum reviews required (e.g., 5), and C is the global average rating across all businesses (e.g., 3.7).

This formula adds “virtual reviews” at the global average, preventing extreme ratings for businesses with few reviews.

Incremental Update Algorithm:

Instead of recalculating from scratch, we maintain running totals. The business_rating_aggregates table stores business ID, total rating sum (sum of all ratings), total review count, counts for each rating level (1-5 stars), weighted average (cached result), last updated timestamp, and version number for optimistic locking.

When a new review is posted, we update the aggregates table by incrementing the total rating sum, incrementing the total review count, incrementing the specific rating count, recalculating the weighted average, updating the timestamp, and incrementing the version number. We use optimistic locking by checking the version number matches the expected value.

Async Processing with SQS: When a review is created, we publish a message to the review-rating-updates SQS queue. The Rating Worker pulls messages in batches (up to 10 messages at once), updates the business_rating_aggregates table, updates the Elasticsearch index with the new rating, and invalidates the Redis cache for this business. If the update fails due to a version mismatch (rare race condition), we retry with exponential backoff.

Handle High-Traffic Businesses:

For extremely popular businesses (e.g., famous restaurants with 10,000+ reviews), we use additional optimization. We batch updates by accumulating multiple review submissions over 5 seconds, apply updates in a single transaction, and use database row-level locking to prevent conflicts.

Rating Distribution Caching:

The rating distribution (how many 1-star, 2-star, etc.) is displayed prominently on business pages. We cache this in Redis with the key pattern business:businessId:rating_dist. The value contains counts for each star rating, the weighted average, and the total count. TTL is 5 minutes.

Deep Dive 4: How do we detect and prevent fake reviews and spam?

Fake reviews and spam erode user trust and damage the platform’s credibility. Yelp must aggressively detect and filter fraudulent content.

Challenge: Competitors may post fake negative reviews. Businesses may post fake positive reviews or incentivize users. Sophisticated attackers use multiple accounts, varied language, and timing. Manual moderation doesn’t scale (1M reviews/day).

Solution: Multi-Layer ML-Based Spam Detection

Layer 1: Rule-Based Filtering (Fast, High Precision)

Catch obvious spam immediately by checking for duplicate content (exact or near-duplicate text from same user or different users), account age (new accounts under 7 days posting multiple reviews flagged for review), review velocity (user posting 10+ reviews in 1 hour), suspicious patterns (reviews containing URLs, phone numbers, email addresses), keyword matching (promotional language like “call now”, “visit website”), and relationship detection (user reviewing multiple businesses with same owner).

The rule-based filter assigns a score from 0-100. Account age under 7 days adds 30 points. Recent reviews over 10 in last hour adds 50 points. Finding similar reviews adds 60 points. Promotional content adds 40 points. Scores over 70 indicate high spam probability.

Layer 2: ML-Based Detection (Deep Learning)

Train a gradient boosting model (XGBoost or LightGBM) with comprehensive features.

User Features: Account age in days, total reviews posted, average review length, review frequency (reviews per month), helpful votes received per total reviews, elite status (binary), profile completeness (has photo, friends, etc.), check-in history (count per business), and social graph metrics (friend count, friend-of-friend connections).

Review Features: Text length (characters, words), sentiment score (using NLP model), linguistic features (readability score, spelling errors, grammar errors), rating deviation from business average, time since last review from this user, photos attached (binary), specific keywords (TF-IDF vectors), and writing style consistency with user’s previous reviews.

Business Features: Total review count, recent review velocity (sudden spike indicates campaign), owner response rate, average rating change after this review, and category (some categories more prone to fake reviews, e.g., dentists, locksmiths).

Relational Features: Connection between reviewer and business owner (same IP, same location, mutual friends), reviewer’s history with competitors of this business, and clustering to identify groups of users with similar review patterns.

The model predicts spam probability. If probability exceeds 0.7, reject the review. If probability is between 0.4 and 0.7, send to human review. If below 0.4, approve. Model performance targets 85% precision (85% of flagged reviews are actually spam) and 78% recall (78% of spam reviews are caught), with an F1 score of 0.81.

Layer 3: Behavioral Analysis (Graph-Based)

Build a graph of relationships: User to Business (reviewed), User to User (friends), Business to Owner, and Device to User (device fingerprint).

Detect suspicious patterns including ring networks (groups of users reviewing same set of businesses), Sybil attacks (multiple accounts from same device/IP), and review farms (businesses receiving reviews from same group of users).

Use graph algorithms including community detection (Louvain algorithm) to find clusters, PageRank to identify trustworthy users, and random walk sampling to detect anomalies.

Layer 4: Human Moderation (Final Layer)

Reviews with medium spam scores (0.4-0.7) go to human moderators. The moderation queue is prioritized by potential impact (popular businesses first). Moderators see review text, user history, model’s confidence scores, and similar reviews. They make decisions to Approve, Reject, or Request More Information from user. Feedback from moderators is used to retrain the ML model.

Infrastructure:

The review submission flows through the API Gateway to the Review Service, which publishes to the SQS review-moderation queue. Multiple workers consume from the queue. The Rule-Based Filter (under 5ms, synchronous) immediately rejects high scores (over 70), approves low scores (under 20), and sends medium scores (20-70) to the ML Layer. The ML Service fetches features from database/cache, performs ML Model Inference (50ms), and makes decisions based on score thresholds. Reviews with scores over 0.7 are rejected, under 0.4 are approved, and between 0.4-0.7 go to the Human Queue in the Moderation Dashboard. Finally, the review status is updated in the database.

Continuous Improvement:

A/B test new model versions (champion vs. challenger). Monitor precision/recall metrics in real-time. Perform weekly model retraining with new labeled data. Use adversarial training to simulate attacks and improve robustness.

Deep Dive 5: How do we build a recommendation engine for businesses?

Beyond search, Yelp provides personalized recommendations to help users discover new businesses they’ll love.

Challenge: Cold start problem for new users with no history. Sparsity as most users only review a tiny fraction of businesses. Real-time updates so recommendations should reflect recent user behavior. Scalability with billions of potential user-business pairs.

Solution: Hybrid Recommendation System

Approach 1: Collaborative Filtering (CF)

Find similar users based on review history, then recommend businesses those similar users liked.

Matrix Factorization with ALS (Alternating Least Squares):

We create a sparse User-Business Rating Matrix where most entries are empty. We factorize this into two matrices: Users (n_users × k) and Businesses (n_businesses × k), where k represents latent factors (e.g., 100). The predicted rating is the dot product of the user and business matrices.

Use Apache Spark MLlib for distributed training with rank 100 latent factors, 10 iterations, regularization parameter 0.1, and cold start strategy of dropping unknown users/items.

Challenges with Pure CF: Cold start for new users/businesses with no embeddings. Doesn’t capture business attributes (category, location). Computationally expensive for real-time serving.

Approach 2: Content-Based Filtering

Recommend businesses similar to ones the user liked, based on business attributes.

Feature Engineering: Category vector (one-hot encoding of primary category), price level (1-4), location embedding (geohash or normalized lat/long), attributes (parking, wifi, outdoor seating) as binary vector, average rating, and popularity (log of review count).

Similarity Computation: For each business, compute an embedding vector. Find similar businesses using cosine similarity. The user profile is computed as a weighted average of businesses they liked (rated 4-5). Compute similarity between user profile and all business features. Return top-K businesses with highest similarity scores.

Approach 3: Hybrid Model (Best of Both Worlds)

Combine CF and content-based with learned weights. The final score is a weighted combination of collaborative filtering score, content-based score, and popularity score, where the weights sum to 1. Learn optimal weights using gradient descent on historical data (maximize click-through rate).

Deep Learning Approach: Two-Tower Model

Modern architecture used by Google, YouTube, and Pinterest. The User Tower takes user ID, history, and location through an embedding layer and dense layers with ReLU activation to produce a user embedding (128 dimensions). The Business Tower takes business ID, category, and location through an embedding layer and dense layers with ReLU activation to produce a business embedding (128 dimensions). The dot product of these embeddings produces the predicted rating.

Training: Use all historical ratings as positive examples. Sample random user-business pairs as negative examples. Objective is to maximize score for positive pairs and minimize for negative pairs. Loss function is binary cross-entropy or pairwise ranking loss.

Serving: Pre-compute business embeddings for all businesses (batch job, daily). Store business embeddings in a vector database (Faiss, Milvus, or Pinecone). At serving time, compute user embedding in real-time (under 5ms), perform approximate nearest neighbor (ANN) search in vector database, and return top-K businesses with highest dot product. ANN search using HNSW (Hierarchical Navigable Small World) index is sub-millisecond.

Cold Start Solutions:

For new users, use location plus demographics to assign to user cluster, show popular businesses in their area, and after first few interactions, update profile.

For new businesses, use content features (category, location, price), bootstrap with similar established businesses, and gradually incorporate collaborative signals as reviews accumulate.

Real-Time Personalization:

Store recent user interactions in Redis with the key pattern user:userId:recent_activity. The value is a sorted set of business ID, timestamp, and action type. TTL is 30 days.

Recompute user embeddings incrementally as they interact, using exponential decay for older interactions.

A/B Testing:

10% of users see ML recommendations. 10% see rule-based recommendations (control). 80% see winning model (from previous test).

Metrics include click-through rate (CTR), conversion rate (clicked to called/checked in), user engagement (time on platform), and revenue impact (for businesses).

Deep Dive 6: How do we handle photo storage, processing, and moderation?

Photos are crucial for Yelp - businesses with photos get 35% more clicks. But storing and serving billions of photos efficiently requires careful design.

Challenge: Over 10M photos uploaded per month (approximately 4 photos/second sustained). Each photo needs multiple resolutions (thumbnail, medium, high-res). Content moderation for inappropriate images. Global CDN delivery for low latency. Cost optimization (storage is expensive at scale).

Solution: Multi-Stage Photo Pipeline

Upload Flow:

Client Request: User selects photo in mobile app. Client requests presigned upload URL from Photo Service. Include metadata: business_id, caption, image dimensions.

Presigned URL Generation: Generate unique photo ID. Create key path for S3 storage. Create presigned URL (valid for 15 minutes) using S3 client with put_object permission. Return presigned URL and photo ID to client.

Direct Upload to S3: Client uploads directly to S3 using presigned URL (bypasses API servers). S3 stores original high-resolution image. S3 event notification triggers Lambda function.

Async Processing Pipeline:

Step 1: Image Processing (Lambda + SQS) Download original image from S3. Generate multiple resolutions: thumbnail (150x150 square crop), small (300x300), medium (600x600), large (1200x1200), and original. Optimize and compress each resolution using quality 85 WebP format. Upload each processed version to S3. Extract metadata including dimensions, file size, EXIF data, and dominant colors.

Step 2: Content Moderation

Phase A: Automated ML Moderation Use pre-trained computer vision models. Load pre-trained ResNet50 for image classification. Preprocess images with resizing, center cropping, normalization. Perform inference to detect inappropriate content. Run custom classifiers to detect NSFW content and violence scores. Mark as safe if both scores are below thresholds.

Use AWS Rekognition or Google Cloud Vision API for explicit content detection, violence/gore detection, text in images (detect spam/ads), and celebrity recognition (privacy concerns).

Phase B: Human Moderation (for ambiguous cases) Photos with moderate scores (0.3-0.7) sent to human moderators. Show image plus context (business type, uploader history). Moderator marks as: Approve, Reject, Request Review. Turnaround time: under 1 hour for 99% of photos.

Update Database: Insert photo record with all URLs (original, thumbnail, medium, large), dimensions, file size, caption, moderation status, and timestamp into the photos table.

CDN Distribution:

Photos cached globally via CloudFront with S3 bucket as origin. Cache TTL is 30 days (photos rarely change). Cache key pattern is size/photoId.webp. Apply Gzip/Brotli compression for faster transfer. Use lazy loading on client: load thumbnails first, high-res on demand.

Cost Optimization:

Use S3 Intelligent Tiering which moves infrequently accessed photos to cheaper storage classes. Older photos (over 2 years, under 10 views) moved to S3 Glacier Deep Archive (80% cost savings). Aggressive WebP compression reduces storage by 30% vs JPEG. Delete original uploads after processing (only keep processed versions).

Estimated Costs: For 10 million photos per month with 1 MB average size, S3 Storage costs approximately $230, CloudFront costs $750 for 1 billion requests, and Lambda processing costs $833 for 10M invocations of 5 seconds each. Total estimated cost is around $2,000/month.

Popular restaurants receive hundreds of reviews, photos, and check-ins per day. Users expect to see the latest information immediately.

Challenge: High-traffic businesses change rapidly (new reviews, rating updates). Cache invalidation at scale. Consistency: all users should see same info within acceptable window (5 seconds).

Solution: Event-Driven Cache Invalidation

Architecture:

Review Service publishes to Kafka Topic: business-updates. A Kafka Consumer Group consumes these events. The consumer group splits into three parallel consumers: one invalidates cache (Redis), one updates Elasticsearch index, and one notifies clients (WebSocket).

Kafka Topic Structure:

The business-updates topic contains messages with event_type (e.g., review_created), business_id, timestamp, and payload containing details like review_id, rating, and whether photos are attached.

Cache Invalidation Strategy:

Pattern 1: Time-Based Expiration (TTL) Hot businesses (over 100 reviews/month) have TTL of 5 minutes. Warm businesses (10-100 reviews/month) have TTL of 30 minutes. Cold businesses (under 10 reviews/month) have TTL of 2 hours.

Pattern 2: Event-Driven Invalidation When review posted, invalidate specific cache keys including business details, top reviews, and rating distribution. Invalidating search results containing this business is complex, so we primarily rely on TTL for those.

Pattern 3: Write-Through Cache For critical data like rating aggregates, update the database then immediately update cache (write-through). The cached data includes computed rating summary with appropriate TTL.

Real-Time Notifications via WebSocket:

For users actively viewing a business page, the client establishes a WebSocket connection. The client sends a subscribe action with the business_id. The server pushes updates to subscribed clients. On receiving updates about new reviews or rating changes, the client refreshes the appropriate sections.

WebSocket Server Implementation: Create WebSocket server on port 8080. Maintain a map of business_id to set of WebSocket connections (subscriptions). Set up Kafka consumer with group ID for WebSocket notifications. Subscribe to business-updates topic. For each message, extract the business_id and send the update to all subscribed clients for that business. Handle new WebSocket connections by allowing clients to subscribe to specific businesses. On WebSocket close, remove from all subscriptions.

Scalability: Run multiple WebSocket server instances behind load balancer. Use sticky sessions (same user always routed to same server). Store subscriptions in Redis for cross-server communication. Limit concurrent WebSocket connections per server (e.g., 10,000).

Step 4: Wrap Up

In this comprehensive design, we’ve architected a production-grade system for Yelp that handles billions of reviews, millions of concurrent users, and complex recommendation algorithms. Let’s recap the key architectural decisions and discuss additional considerations.

Core Architecture Summary:

  1. Geospatial Search: Elasticsearch with BKD trees and in-memory quad trees for sub-200ms query latency across 200M businesses.

  2. Review Ranking: Multi-signal algorithm combining helpfulness votes, reviewer credibility, recency, and quality metrics with pre-computed scores cached in Redis.

  3. Rating Aggregation: Incremental updates using running totals with weighted Bayesian averages, processed asynchronously via SQS queues.

  4. Spam Detection: Multi-layer approach with rule-based filtering, ML models (XGBoost), graph analysis, and human moderation achieving 85% precision and 78% recall.

  5. Recommendations: Hybrid system combining collaborative filtering (matrix factorization), content-based filtering, and two-tower neural networks with vector similarity search.

  6. Photo Pipeline: Direct S3 uploads with async processing (resizing, compression, WebP conversion), ML-based content moderation, and global CDN distribution.

  7. Real-Time Updates: Event-driven architecture with Kafka, intelligent cache invalidation, and WebSocket notifications for live business pages.

Scaling Considerations:

Geographic Sharding: Shard PostgreSQL databases by metro area (NYC, SF, LA, etc.). Each shard has read replicas for query scaling. Cross-shard queries are rare but handled via scatter-gather.

Elasticsearch Scaling: Deploy 50+ node cluster with index-per-region strategy. Use hot-warm-cold architecture: recent data on SSDs, old data on HDDs. Implement reindex pipeline for schema changes without downtime.

Cache Hierarchies: L1 is application cache (in-memory, per instance). L2 is Redis cluster (distributed cache). L3 is CDN (CloudFront for API responses and media).

Async Processing: Use Kafka for event streaming (review updates, check-ins). Use SQS for task queues (photo processing, spam detection). Use Temporal for workflow orchestration (multi-step review moderation).

Database Optimizations: Connection pooling (PgBouncer) to handle 10,000+ concurrent connections. Read-heavy optimizations include materialized views and covering indexes. Partition reviews table by creation month.

Additional Features for Discussion:

Reservations Integration: Partner with OpenTable, Resy for reservation APIs. Display real-time availability. Track conversion funnel: view to click reserve to complete.

Business Analytics Dashboard: Real-time metrics include page views, calls, direction requests. Trend analysis covers rating over time and review sentiment. Competitive benchmarking compares to similar businesses.

Advanced Search Features: Natural language queries like “best tacos open now”. Voice search optimization for mobile. Visual search: upload food photo, find similar restaurants.

Machine Learning Enhancements: Review summarization using GPT-4 to generate TL;DR from reviews. Sentiment analysis to extract specific aspects (food, service, ambiance). Demand prediction to predict busy times using historical check-in data. Churn prediction to identify businesses at risk of losing customers.

Monetization Systems: Sponsored search results (clearly marked). Enhanced business profiles (more photos, videos). Retargeting campaigns based on user behavior. Request-a-Quote paid leads for service businesses.

Trust & Safety: Two-factor authentication for business owners. Blockchain-based review verification (experimental). Reputation management tools for businesses. Transparency reports showing filtered review counts.

Mobile Optimizations: Offline mode to cache recent searches and favorites. Progressive image loading with blur placeholder to low-res to high-res. Prefetching to predict user’s next action and preload data. Push notifications for new reviews on bookmarked businesses.

Monitoring & Observability:

Key Metrics: Search latency P50, P95, P99 response times. Cache hit rate targeting over 80% for business details. ML model performance: precision, recall, F1 for spam detection. Conversion rates: search to view to action (call, directions, reservation). System health: error rates, CPU, memory, disk I/O.

Tools: Prometheus plus Grafana for metrics visualization. Datadog or New Relic for distributed tracing. ELK stack (Elasticsearch, Logstash, Kibana) for log aggregation. PagerDuty for on-call alerting.

SLIs/SLOs: Search availability: 99.9% uptime. Search latency: P95 under 200ms. Review submission success rate: 99.5%. Photo upload success rate: 99%.

Disaster Recovery:

Multi-region deployment: US-West, US-East, EU. Database backups: hourly incremental, daily full. Point-in-time recovery: restore to any point in last 30 days. Chaos engineering: regularly test failure scenarios (Chaos Monkey). Runbooks with documented procedures for common incidents.

Security Considerations:

API rate limiting: 1000 requests/hour per user, 100/hour per IP. DDoS protection: CloudFlare or AWS Shield. SQL injection prevention: parameterized queries, ORM. XSS prevention: Content Security Policy headers. Sensitive data encryption: at rest (AES-256) and in transit (TLS 1.3). Regular security audits and penetration testing.

Compliance:

GDPR: right to access, delete, and export data. CCPA: opt-out of data selling (Yelp doesn’t sell). COPPA: age verification (13+ requirement). PCI DSS: payment processing compliance (if handling payments).

Future Improvements:

AR features: point camera at storefront, see reviews overlay. Social features: follow friends, see their reviews in feed. Video reviews: short-form video content (TikTok-style). AI assistant: conversational search (“find me a romantic restaurant”). Predictive ordering: suggest dishes based on preferences.

Designing Yelp at scale requires balancing complex technical challenges with user experience and business needs. The key is starting with solid fundamentals (efficient search, reliable review system) and layering in advanced features (ML recommendations, real-time updates) as the platform grows. By focusing on data quality, spam prevention, and performance optimization, we can build a system that users trust and businesses rely on.

Congratulations on making it through this comprehensive system design! This architecture can handle hundreds of millions of users, billions of reviews, and petabytes of photos while maintaining sub-second response times and high availability.