Design Ticketmaster
Ticketmaster is an event ticketing platform that allows users to discover, browse, and purchase tickets for concerts, sports events, theater shows, and other live entertainment. The platform provides real-time seat selection, secure payment processing, and ticket delivery through multiple channels including email, PDF, and mobile wallet integration.
Designing a production-grade ticketing system like Ticketmaster presents unique challenges: preventing double-booking under extreme concurrency, handling massive traffic spikes during major on-sales, implementing fair virtual queuing for high-demand events, detecting and preventing bot activity and fraud, processing secure high-value transactions, and providing real-time seat availability updates with interactive seat maps.
Step 1: Understand the Problem and Establish Design Scope
Before diving into the design, it’s crucial to define the functional and non-functional requirements. For ticketing platforms, functional requirements focus on the user journey from discovery to ticket delivery, while non-functional requirements address the system’s ability to handle extreme load and maintain data consistency.
Functional Requirements
Core Requirements:
- Users should be able to search and browse events by location, artist, venue, date, and category.
- Users should be able to view real-time seat maps with availability and select specific seats.
- Users should be able to hold selected seats temporarily while completing their purchase.
- Users should be able to complete secure payment and receive tickets via email, PDF, or mobile wallet.
- For high-demand events, users should enter a virtual waiting room and be admitted fairly.
- Users should be able to transfer or resell tickets on a managed marketplace.
Below the Line (Out of Scope):
- Users should be able to set price alerts for events.
- Venue managers should be able to create and configure events with custom seating charts.
- Users should be able to purchase insurance for their tickets.
- Users should be able to upgrade their seats after purchase.
- Event organizers should be able to generate reports and analytics.
Non-Functional Requirements
Core Requirements:
- The system must guarantee strong consistency for seat inventory to prevent any double-booking.
- The system should handle massive traffic spikes (50 million concurrent users during major on-sales).
- The system should process ticket purchases with low latency (seat lock < 100ms p99, end-to-end purchase < 5s p95).
- The system should ensure high availability during critical on-sale periods (99.99% uptime).
- The system should implement effective bot detection and fraud prevention mechanisms.
Below the Line (Out of Scope):
- The system should comply with data privacy regulations (GDPR, CCPA).
- The system should support multi-currency transactions globally.
- The system should provide disaster recovery with RPO < 1 hour.
- The system should implement comprehensive audit logging for compliance.
Clarification Questions & Assumptions:
- Platform: Web and mobile apps for consumers, separate portal for venues.
- Scale: 500 million registered users, 10 million daily active users, peaks of 50 million concurrent users during major events.
- Events: 100,000 active events at any time, average venue capacity of 20,000 seats.
- Peak Throughput: 100,000 tickets sold per minute during major on-sales.
- Hold Duration: Seats held for 5-10 minutes during checkout process.
- Payment Processing: Third-party processors (Stripe, Braintree) used for PCI compliance.
Step 2: Propose High-Level Design and Get Buy-in
Planning the Approach
For ticketing systems, we’ll build the design sequentially through the user journey: event discovery, seat selection, purchase flow, and ticket delivery. We’ll pay special attention to inventory management and queuing, as these are the most critical and complex components.
Defining the Core Entities
To satisfy our key functional requirements, we’ll need the following entities:
Event: Represents a scheduled performance or game. Includes event name, venue, date and time, on-sale date, status (scheduled, on-sale, sold-out, cancelled), pricing configuration by section, and event metadata.
Venue: Physical location where events take place. Contains venue name, address, total capacity, seating chart configuration (sections, rows, seats), and visual seat map data for rendering.
Seat: Individual seat within a venue for a specific event. Includes event ID, section, row, seat number, price tier, base price, current price (dynamic pricing), status (available, held, sold, reserved), hold expiration timestamp, and order ID if sold.
Fare: Price estimate or quote for specific seats before purchase. Contains seat IDs, prices, fees, total amount, and expiration time.
Order: A completed purchase transaction. Records buyer information, event ID, seat IDs, payment details, total amount paid, order status (pending, completed, cancelled, refunded), and timestamps.
Ticket: The deliverable proof of purchase. Contains ticket ID, order ID, event details, seat information, QR code data, barcode, delivery method, and transfer history.
Queue: Virtual waiting room position for high-demand events. Includes user ID, event ID, queue position, entry timestamp, estimated wait time, and access token when admitted.
API Design
Event Search Endpoint: Allows users to search for events by various criteria including location, date, artist, and category.
GET /events?location={city}&date={date}&category={category} -> Event[]
Get Seat Map Endpoint: Retrieves the seat map configuration with real-time availability for an event.
GET /events/{eventId}/seatmap -> SeatMap
Response: {
eventId, eventName, seatMapConfig, availabilityBitmap, sectionPricing
}
Hold Seats Endpoint: Attempts to lock selected seats for a user during the checkout process. Returns success if seats were available and locked, or failure if already taken.
POST /seats/hold -> HoldResult
Body: {
eventId: string,
seatIds: string[]
}
Response: {
success: boolean,
holdId: string,
heldUntil: timestamp,
seats: Seat[]
}
Purchase Tickets Endpoint: Completes the purchase flow with payment processing. Confirms seat ownership and creates order.
POST /orders -> Order
Body: {
holdId: string,
paymentInfo: PaymentDetails
}
Join Queue Endpoint: Adds user to virtual waiting room for high-demand events.
POST /queue/{eventId}/join -> QueuePosition
Response: {
position: number,
queueLength: number,
estimatedWaitSeconds: number,
queueToken: string
}
Queue Status Endpoint: Checks user’s current position in queue or admission status.
GET /queue/{eventId}/status -> QueueStatus
Response: {
status: "queued" | "admitted",
position?: number,
accessToken?: string
}
List Ticket for Resale Endpoint: Allows ticket owners to list tickets on the resale marketplace with price validation.
POST /resale/listings -> ResaleListing
Body: {
ticketId: string,
askingPrice: number
}
High-Level Architecture
Let’s build up the system sequentially, addressing each functional requirement:
1. Users should be able to search and browse events
The core components for event discovery are:
- Client Applications: Web and mobile apps that provide the user interface for browsing events. Built with responsive design to support various screen sizes.
- API Gateway: Central entry point that handles authentication, rate limiting, request routing, and DDoS protection. Forwards requests to appropriate backend services.
- Event Service: Manages event catalog, search functionality, and event metadata. Stores event data in PostgreSQL for structured queries and Elasticsearch for fast full-text search. Implements caching in Redis for frequently accessed events.
- CDN: Content Delivery Network (CloudFlare, Akamai) caches static assets including images, seat map graphics, and API responses for read-heavy endpoints.
Event Search Flow:
- User enters search criteria (location, date, artist) in the client app, which sends a GET request to
/events. - API Gateway authenticates the request and applies rate limiting before forwarding to Event Service.
- Event Service queries Elasticsearch for fast full-text search across event names, artists, and venues, applying filters for location and date.
- Results are enriched with availability status from the Inventory Service and returned to the client.
- Popular searches are cached in Redis with a TTL of 5 minutes to reduce database load.
2. Users should be able to view real-time seat maps and select seats
We extend the design to support interactive seat selection:
- Inventory Service: Central component managing real-time seat availability. Uses PostgreSQL for persistent storage with row-level locking to prevent race conditions. Maintains a Redis bitmap cache for fast availability checks (20,000 seats checked in under 1ms).
- WebSocket Service: Maintains persistent connections with clients to push real-time seat availability updates. When seats are sold or released, broadcasts updates to all connected clients viewing that event.
Seat Map Flow:
- User views an event and requests the seat map by calling
/events/{eventId}/seatmap. - Event Service retrieves venue configuration from the database, which includes SVG paths and coordinates for rendering the visual seat map.
- Inventory Service queries Redis for the availability bitmap, where each bit represents one seat (1 = available, 0 = unavailable).
- The response includes seat map geometry, availability data, and pricing information grouped by section.
- Client establishes WebSocket connection to receive real-time updates as other users purchase tickets.
- When seats change status, PostgreSQL triggers a notification that the WebSocket Service broadcasts to connected clients.
3. Users should be able to hold seats temporarily during checkout
Critical component for preventing double-booking:
- Locking Mechanism: Uses pessimistic locking with PostgreSQL’s
SELECT FOR UPDATE NOWAITto provide fail-fast semantics. When a user attempts to hold seats, the database locks those specific rows. - Hold Expiration: Background cleanup job runs every 30 seconds to release seats where the hold TTL has expired. This ensures seats don’t remain locked if users abandon checkout.
Seat Hold Flow:
- User selects seats in the UI and clicks “Continue to Checkout”, sending a POST request to
/seats/hold. - API Gateway forwards request to Inventory Service.
- Inventory Service begins a database transaction and attempts to acquire row-level locks on the specified seats with
SELECT FOR UPDATE NOWAIT. - If any seat is already locked or sold, the transaction fails immediately and returns an error to the user.
- If all seats are available, the service updates their status to “held”, sets the
held_byfield to the user ID, and setsheld_untilto current time plus 10 minutes. - A hold record is created for tracking, and the Redis availability cache is invalidated for this event.
- The service returns a holdId and expiration timestamp to the client, which displays a countdown timer.
4. Users should be able to complete payment and receive tickets
We add payment processing and ticket generation:
- Booking Service: Orchestrates the end-to-end purchase workflow. Implements a state machine pattern to handle the multi-step process: validate hold, confirm payment, finalize order, generate tickets.
- Payment Service: Integrates with third-party payment processors (Stripe, Braintree) for PCI-compliant payment processing. Implements idempotency to prevent duplicate charges. Handles refunds and chargebacks.
- Fraud Service: Real-time fraud detection using machine learning models. Analyzes device fingerprints, purchase patterns, payment history, and velocity limits. Returns risk score and recommended action (approve, challenge, block).
- Ticket Delivery Service: Generates tickets with QR codes and barcodes after successful payment. Creates PDF tickets, sends email confirmations, and generates mobile wallet passes (Apple Wallet, Google Pay).
- Notification Service: Sends transactional emails and push notifications for order confirmations, ticket delivery, and important updates.
Purchase Flow:
- User completes payment form and submits, sending POST request to
/orderswith holdId and payment information. - Booking Service validates the hold is still active and hasn’t expired.
- Fraud Service analyzes the transaction and returns a risk score. If high risk, additional verification (CAPTCHA, 2FA) may be required.
- If fraud check passes, Booking Service calls Payment Service to process the transaction. Payment Service charges the card using the external payment gateway.
- Upon successful payment, Booking Service updates seat status from “held” to “sold” in a database transaction, creates an order record, and stores payment confirmation.
- Ticket Delivery Service generates unique tickets with encrypted QR codes and rotating barcodes (changes every 30 seconds for anti-fraud).
- Tickets are delivered via email (PDF attachment) and available for download as mobile wallet passes.
- Notification Service sends confirmation email with ticket details and event information.
5. For high-demand events, users should enter a virtual waiting room
Critical for managing extreme traffic spikes:
- Queue Service: Implements virtual waiting room using Redis sorted sets. When users attempt to access a high-demand event, they’re placed in a FIFO queue with their entry timestamp as the score. Controls admission rate to prevent overwhelming the Inventory Service.
- Queue Admission Worker: Background process that continuously admits users from the queue at a controlled rate (e.g., 500 users per second). Generates access tokens for admitted users with a TTL of 30 minutes.
Virtual Queue Flow:
- When major event tickets go on sale, Queue Service is activated for that event.
- User attempts to access event page, but instead of showing seat map, they’re redirected to virtual waiting room.
- Client sends POST request to
/queue/{eventId}/join, and Queue Service adds them to a Redis sorted set with timestamp score. - User receives their queue position and estimated wait time. Client polls
/queue/{eventId}/statusevery 5 seconds to get updates. - Queue Admission Worker runs continuously, checking active user count. If below capacity limit (e.g., 10,000 concurrent users), it pops the next users from the queue.
- For each admitted user, worker generates an access token stored in Redis with 30-minute TTL and sends notification via WebSocket.
- Client receives admission notification and is redirected to seat selection page with access token.
- All subsequent requests include access token, which Inventory Service validates before allowing seat holds.
- If user doesn’t complete purchase within 30 minutes, their token expires and they must rejoin the queue.
6. Users should be able to transfer or resell tickets
Final component for secondary market:
- Resale Marketplace Service: Manages ticket listings, price validation, and transfers. Enforces anti-scalping rules (e.g., maximum 120% of face value). Tracks ticket ownership and transfer history. Calculates commission fees on resale transactions.
Resale Flow:
- Ticket owner lists ticket by calling
/resale/listingswith ticketId and asking price. - Resale Service validates ownership by checking order records and verifies ticket status is “active”.
- Price validation ensures asking price doesn’t exceed configured maximum (e.g., 120% of base price) to prevent scalping.
- If valid, service creates listing record, marks original ticket as “listed_for_resale”, and makes it visible in marketplace search results.
- Buyer purchases resale ticket through standard payment flow with Payment Service.
- Upon successful payment, Resale Service updates original ticket status to “transferred”, records transfer history, and generates new ticket for buyer with unique QR code.
- Seller receives payout minus commission (e.g., 15% platform fee), processed through Payment Service.
- Both buyer and seller receive confirmation notifications with transfer details.
Step 3: Design Deep Dive
With core functionality established, let’s address the critical architectural challenges and non-functional requirements that make this system production-ready at scale.
Deep Dive 1: How do we prevent double-booking under extreme concurrency?
The absolute highest priority requirement is preventing double-booking. A single instance of selling the same seat twice erodes customer trust and creates legal liability. Under normal load, row-level database locks are sufficient, but during peak on-sales with millions of concurrent requests, we need multiple layers of protection.
Challenge: When 50,000 users simultaneously try to purchase the same seat during a Taylor Swift concert on-sale, simple application-level checks create race conditions. Between reading availability and writing the purchase, another transaction could complete. Even with database transactions, without proper isolation levels, phantom reads can occur.
Solution: Multi-Layer Consistency Strategy
Layer 1: PostgreSQL Row-Level Locking
The foundation is PostgreSQL’s row-level locking with SELECT FOR UPDATE NOWAIT. When the Inventory Service attempts to hold seats, it locks those specific rows. The NOWAIT option provides fail-fast behavior—if the seat is already locked by another transaction, it immediately returns an error rather than blocking. This prevents cascading delays and provides instant feedback to users.
The seat inventory table uses a composite unique constraint on (event_id, section, row, seat_number) to enforce uniqueness at the database level. Each seat row includes a version number that increments on every update, enabling optimistic concurrency detection if needed.
Layer 2: Database Partitioning by Event Major events get dedicated table partitions. For example, Taylor Swift’s concert would have its own partition. This reduces lock contention by isolating high-traffic events from normal events. Partition pruning also improves query performance since the database only scans relevant partitions.
Layer 3: Redis Cache Invalidation While PostgreSQL provides strong consistency, querying it for every seat availability check would create unacceptable latency. We maintain a Redis bitmap cache where each bit represents one seat. Before attempting to hold seats, we do a fast check against this bitmap. If it shows unavailable, we immediately reject without hitting the database.
The critical detail: cache invalidation happens synchronously within the same database transaction that updates seat status. After updating PostgreSQL, but before committing, we delete the Redis cache key. This ensures cache never shows a seat as available after it’s sold. We accept the cache occasionally showing unavailable seats that are actually available (conservative approach), but never the reverse.
Layer 4: Background Hold Cleanup
A background worker runs every 30 seconds to release expired holds. It finds seats where status is “held” and held_until timestamp is in the past, atomically updating them back to “available”. This handles cases where users abandon checkout or the application crashes before releasing the hold. Cleanup happens in small batches to avoid creating lock contention.
Layer 5: Distributed Transaction Idempotency The Booking Service generates a unique idempotency key for each purchase attempt (derived from holdId). Payment Service and Inventory Service check this key before processing. If the same request arrives twice (network retry, user double-click), the second request returns the cached result instead of attempting a duplicate purchase.
Why Not Optimistic Locking?
Optimistic locking (version checks without row locks) works well for low-contention scenarios but fails under heavy load. When thousands of users target the same seats, optimistic transactions would retry repeatedly, causing thundering herd effects and poor user experience. Pessimistic locking with NOWAIT provides deterministic behavior: exactly one transaction succeeds, others fail immediately with clear feedback.
Deep Dive 2: How do we handle 50 million concurrent users during major on-sales?
When Taylor Swift announces a tour, millions of fans flood the system simultaneously. Normal capacity might handle 100,000 concurrent users, but on-sale events create 500x spikes. We can’t simply scale every component 500x for 1% of the year—the cost would be prohibitive.
Challenge: Uncontrolled access overwhelms every system component. API servers hit CPU limits. Database connection pools exhaust. Payment providers throttle us. Seat lock operations time out. The system degrades completely, and no one gets tickets despite available inventory.
Solution: Virtual Waiting Room with Controlled Admission
Queue Architecture: The Queue Service acts as a protective buffer before the Inventory Service. It uses Redis sorted sets to maintain queue order. When a user accesses a high-demand event, they’re added to the queue with their entry timestamp as the score. Sorted sets provide O(log N) insertion and O(log N) retrieval, efficiently handling millions of queue positions.
Fair FIFO Ordering: The timestamp score ensures first-come, first-served fairness. To prevent ties (many users arriving in the same millisecond), we add small random jitter (0-1ms) to each score. This randomizes tie-breaking while maintaining overall fairness. Alternatively, we could use sequential counter values for perfect fairness, but timestamps are simpler and “fair enough” for this use case.
Controlled Admission Rate: A background worker continuously monitors the queue and active user count. The target is to maintain exactly 10,000 concurrent active users in the seat selection flow. This number is calibrated based on system capacity: database connection limits, API server CPU, payment processing throughput, and network bandwidth.
Every second, the worker calculates available slots (10,000 - current_active_count) and admits the next batch of users from the queue. For example, if 8,000 users are active, it admits 2,000 more. Each admitted user receives an access token with a 30-minute TTL stored in Redis.
Token Validation: All requests to Inventory Service require a valid access token. The service checks Redis for token existence and validity before processing seat holds. Invalid or expired tokens are rejected immediately. This prevents queue jumping and ensures only admitted users can purchase tickets.
Queue Position Updates: Clients poll the queue status endpoint every 5 seconds to update the user’s position. The service returns current position, total queue length, and estimated wait time (calculated as position / admission_rate). If the queue length is 500,000 and admission rate is 500 users/second, estimated wait is 1,000 seconds (16 minutes).
Graceful Degradation: If Redis fails, the Queue Service falls back to probabilistic admission: randomly admit X% of requests based on current load. This maintains system protection while degrading the fairness guarantee. Health checks alert operators immediately to restore full queue functionality.
WebSocket for Real-Time Updates: Instead of polling, clients can establish WebSocket connections for real-time position updates and admission notifications. The Queue Service publishes events to a Kafka topic, which WebSocket Service consumes and pushes to connected clients. This reduces API load and provides better user experience.
Deep Dive 3: How do we implement dynamic pricing based on demand?
Ticketmaster uses dynamic pricing to maximize revenue while maintaining fairness. Prices adjust in real-time based on demand signals: sale velocity, remaining inventory, search volume, and queue length.
Pricing Algorithm Components:
Demand Signal Collection: The Pricing Service continuously monitors multiple indicators. Sale velocity measures tickets sold per minute in each section over the last 5 minutes. Inventory percentage tracks remaining available seats in each section. Search volume counts how many users have viewed seats in each section. Queue length indicates overall demand pressure. Time to event affects urgency (event in 2 days with high inventory triggers discounts).
Price Multiplier Calculation: Each demand signal contributes to a price multiplier. The base multiplier starts at 1.0 (no adjustment). High sale velocity (>50 tickets/min) increases the multiplier up to 1.5x. Low inventory (<20% remaining) adds a scarcity premium up to 2.5x. High search volume indicates interest and can add 0.3x. Long queues (>10,000 users) suggest high demand and add up to 0.4x. Near event dates with high inventory trigger discounts (0.85x minimum).
Bounded Price Adjustments: To prevent extreme swings, multipliers are bounded between 0.8x (20% discount) and 3.0x (200% premium). Final prices are rounded to the nearest $0.50 for psychological pricing. Prices never change more than 10% in a single update to avoid shocking users mid-checkout.
Update Frequency: A background job recalculates prices for all sections every 60 seconds during active sales. For each section, it calculates demand signals, derives the multiplier, and updates all available seats in that section with a single bulk UPDATE statement. The job runs continuously for events on active sale and pauses when events sell out or aren’t actively selling.
User Experience Considerations: Once a user receives a fare estimate, that price is locked for their session (10 minutes). This prevents the frustrating experience of prices changing mid-checkout. The holdId links to the quoted price, which Booking Service honors even if current prices are higher. This creates trust while still capturing dynamic pricing benefits across the broader user base.
A/B Testing Framework: The Pricing Service includes a feature flag system for testing pricing strategies. Different user cohorts see different algorithms. Conversion rates, revenue per seat, and user satisfaction scores are tracked. Winning strategies are gradually rolled out to larger percentages of traffic.
Deep Dive 4: How do we detect and prevent bot activity and fraud?
Bots and scalpers are sophisticated adversaries. Simple CAPTCHAs are defeated by solving services. We need multi-layered defense across the entire purchase flow.
Device Fingerprinting: The Fraud Service generates a fingerprint from browser characteristics: user agent, screen resolution, timezone, installed fonts, WebGL capabilities, canvas rendering signature, audio context properties, and available sensors. This creates a unique identifier for each device without cookies or storage.
The system tracks how many accounts use the same fingerprint. Normal users have one account per device. Fraudsters running bot farms often use the same fingerprint across many accounts. If a fingerprint is associated with more than 5 accounts, all transactions from that device receive elevated scrutiny.
Behavioral Analysis: Legitimate users browse events, view seat maps, and spend time considering options before purchasing. Bots navigate directly to seat selection and complete checkout in seconds. The Fraud Service tracks interaction patterns: time on page, mouse movements, scroll behavior, and touch events. Purchases completed in less than 2 seconds from page load are flagged as suspicious.
Velocity Limits: The service tracks purchase attempts per user per event within a rolling 1-hour window. More than 10 attempts suggest automated activity. Similarly, it tracks failed payment attempts per IP address. More than 3 failed cards from the same IP within an hour indicates card testing (trying stolen cards to find valid ones).
Payment Analysis: The service maintains a database of stolen card BIN numbers (first 6 digits) and last 4 digits from chargeback reports. New purchases are checked against this database. Additionally, it looks for suspicious patterns: mismatched billing address and card location, high-value first purchase on new account, and rapid successive purchases across multiple events.
Machine Learning Risk Scoring: All the signals feed into a machine learning model (trained on historical fraud cases) that outputs a risk score from 0 to 1. The model considers device fingerprint, behavioral signals, velocity limits, payment indicators, account age, purchase history, and interaction patterns.
Risk-Based Actions: Low risk (score < 0.3): Approve automatically. Medium risk (0.3-0.6): Send to manual review queue. Medium-high risk (0.6-0.85): Challenge with additional verification (CAPTCHA, SMS confirmation, 3D Secure authentication). High risk (> 0.85): Block transaction and notify security team.
Continuous Learning: Fraud patterns evolve constantly. The system implements a feedback loop where manual review decisions and chargeback outcomes retrain the model. Fraudsters adapt quickly, so the model retrains daily with new data. A/B testing ensures new model versions improve detection without increasing false positives.
Deep Dive 5: How do we ensure no purchase requests are lost during system failures?
The Booking Service orchestrates a complex multi-step workflow: validate hold, assess fraud, process payment, update inventory, generate tickets, and send notifications. If the service crashes mid-workflow, we must ensure the purchase completes correctly.
Challenge: Simple request-response patterns fail for long-running, multi-step workflows. If Payment Service charges the card successfully but Booking Service crashes before confirming the seat, the user is charged without receiving tickets. Manual reconciliation is expensive and erodes trust.
Solution: Durable Workflow Orchestration
Message Queue for Ride Requests: When a user submits a purchase, instead of processing synchronously, the Booking Service enqueues the request to a durable message queue (Kafka, AWS SQS, or RabbitMQ). The queue provides durability guarantees: messages are persisted to disk and replicated across brokers. If the Booking Service crashes, messages remain in the queue for another instance to process.
Consumer Groups and Partitioning: Multiple Booking Service instances consume from the queue as a consumer group. Each message (purchase request) is processed by exactly one consumer. Kafka partitions enable parallel processing: requests for different events can be processed concurrently, while requests for the same event are processed sequentially to avoid contention.
Saga Pattern for Distributed Transactions: The purchase workflow is implemented as a saga: a sequence of local transactions with compensating actions for rollback. Steps include: validate hold, assess fraud risk, charge payment, confirm booking, generate tickets, send notifications. Each step is a separate transaction that can succeed or fail independently.
If any step fails after payment succeeded, the saga executes compensating transactions: refund payment, release seat hold, and notify user of failure. The saga state is persisted at each step, so if the service crashes, the next instance continues from the last checkpoint.
Workflow Orchestration with Temporal: For complex workflows, we use Temporal (or AWS Step Functions, Uber’s Cadence). Temporal provides durable execution: workflow code is persisted at each step. If a worker crashes, another worker resumes from the last checkpoint. Timeouts are handled declaratively: if payment takes more than 30 seconds, the workflow automatically retries or compensates.
Activities (individual steps) are idempotent: calling them multiple times with the same input produces the same result. This is critical for retry safety. For example, if “charge payment” activity is called twice with the same idempotency key, Payment Service charges once and returns the cached result for the duplicate request.
Dead Letter Queue for Unrecoverable Failures: After maximum retries (e.g., 3 attempts), failed purchases are moved to a Dead Letter Queue (DLQ) for manual investigation. Operations team monitors DLQ depth and investigates failures: Was it a transient error (retry manually)? A bug (fix and redeploy)? An invalid request (notify user)? Persistent DLQ depth triggers alerts.
Idempotency Keys: Every purchase request includes an idempotency key (derived from holdId or generated by client). All downstream services (Payment, Inventory, Ticket Delivery) check this key before processing. If they’ve already processed this key, they return the cached result instead of executing again. Idempotency ensures network retries and workflow retries don’t create duplicate charges or tickets.
Deep Dive 6: How do we scale geographically for global events?
Ticketmaster operates globally with regional performance requirements. Users in Europe shouldn’t wait for requests to round-trip to US data centers.
Multi-Region Deployment: The system deploys in multiple geographic regions (US East, US West, Europe, Asia Pacific). Each region has a complete stack: API Gateway, application services, databases, and caches. Users are routed to the nearest region via DNS-based geographic routing (Route 53 GeoDNS, Cloudflare Argo).
Regional Database Strategy: Event and venue data is replicated across all regions for low-latency reads. PostgreSQL streaming replication or AWS Aurora Global Database provides near-real-time replication with <1 second lag. Reads are served from regional replicas while writes go to the primary region.
For seat inventory, consistency is critical. We use a single authoritative region per event (typically where the venue is located). All seat hold and purchase requests are routed to that region’s Inventory Service. This sacrifices some latency for absolute consistency. For a London concert, European users have low latency, while US users accept higher latency for this critical operation.
CDN for Static Assets: Seat maps, event images, and venue photos are stored in object storage (S3) and served through a global CDN (CloudFront, Cloudflare). Cache headers are configured for long TTLs (24 hours for images, 5 minutes for frequently updated data). 99% of static requests are served from edge locations within 50ms of users.
Edge Caching for Read APIs: Event search and seat map retrieval are cacheable. API Gateway implements edge caching with short TTLs (30-60 seconds). Most requests hit cache at edge locations without reaching origin services. Cache invalidation happens via event-driven purges when data changes.
Write Path Remains Centralized: Creating orders, holding seats, and processing payments remain centralized in the event’s authoritative region. This is acceptable because these are infrequent operations (once per user per purchase) compared to browsing (many events viewed per purchase). Optimizing read latency has larger impact on user experience than write latency.
Step 4: Wrap Up
In this design, we proposed a comprehensive architecture for an event ticketing platform like Ticketmaster capable of handling extreme scale, preventing double-booking through multi-layered consistency mechanisms, managing massive traffic spikes with virtual queuing, and providing robust fraud prevention.
Key Architectural Decisions:
Strong Consistency for Seat Inventory: We chose PostgreSQL with row-level pessimistic locking over eventually consistent NoSQL databases. While this limits throughput compared to optimistic concurrency, it provides absolute guarantee against double-booking. For high-value, low-frequency transactions like ticket purchases, correctness trumps maximum throughput. Partitioning by event and strategic caching mitigate the performance trade-off.
Redis-Based Virtual Queue: Sorted sets provide efficient FIFO queuing with O(log N) operations, handling millions of queue positions. Token-based admission with controlled rate prevents system overload during traffic spikes. If Redis fails, falling back to probabilistic admission maintains system protection while degrading fairness guarantees.
Multi-Layer Fraud Prevention: No single signal reliably detects fraud. Device fingerprinting catches bot farms, behavioral analysis catches automated tools, velocity limits catch brute force attempts, and payment analysis catches stolen cards. Machine learning combines all signals for holistic risk scoring. Continuous learning adapts to evolving fraud patterns.
Durable Workflow Orchestration: The saga pattern with compensating transactions handles the complex purchase workflow reliably. Message queues provide durability across service restarts. Temporal provides exactly-once execution semantics. Idempotency keys prevent duplicate processing. Together, these ensure zero dropped purchases even during failures.
Dynamic Pricing for Revenue Optimization: Real-time demand signals (velocity, scarcity, search volume) adjust prices within configured bounds. Price locking during checkout maintains user trust while capturing pricing power across the user base. A/B testing framework continuously improves pricing strategies.
Geographic Distribution: Multi-region deployment reduces latency for global users. Regional read replicas serve event data locally. Seat inventory remains centralized per event for consistency. CDN and edge caching optimize the read-heavy browsing experience.
Additional Discussion Topics:
Scaling Considerations:
- Horizontal scaling: All application services are stateless, enabling auto-scaling based on traffic. During major on-sales, we can rapidly scale from 100 to 10,000 instances.
- Database scaling: Read replicas handle read-heavy operations. Partitioning by event isolates high-traffic events. Connection pooling (PgBouncer) maximizes database connection efficiency.
- Cache scaling: Redis Cluster provides horizontal scaling for cache and queue data across 100+ nodes.
- Message queue scaling: Kafka partitioning enables parallel processing. Consumer group rebalancing handles instance failures and scaling events.
Monitoring and Observability:
- Critical metrics: Queue length and wait time per event, seat lock success rate (target >99%), payment success rate (target >98%), P99 latency for seat locks (<100ms), end-to-end purchase time P95 (<5s), double-booking incidents (target: zero).
- Distributed tracing: Track individual purchase requests across all services to identify bottlenecks. OpenTelemetry or AWS X-Ray provides request-level visibility.
- Real-time dashboards: Operations team monitors live metrics during major on-sales. Automated alerts trigger for anomalies.
- SLI/SLOs: 99.99% availability during on-sales, 99.9% otherwise. Seat lock latency P99 <100ms. Zero double-bookings (hard requirement).
Security Considerations:
- Encryption: TLS 1.3 for all data in transit. AES-256 encryption for sensitive data at rest (payment tokens, personal information).
- Authentication: JWT tokens with short expiration (1 hour), refresh token rotation. OAuth 2.0 for third-party integrations.
- Payment security: PCI-DSS Level 1 compliance via third-party processors. Never store raw card data. Payment tokens are encrypted and scoped to single use.
- Rate limiting: Per-user and per-IP limits prevent abuse. Distributed rate limiting with Redis maintains global limits across all API Gateway instances.
- DDoS protection: Cloudflare or AWS Shield provide Layer 3/4/7 DDoS protection. API Gateway implements application-layer rate limiting and request validation.
Future Enhancements:
- Blockchain-based NFT tickets: Immutable ownership records, programmable resale royalties, and cryptographic proof of authenticity. Could eliminate ticket counterfeiting entirely.
- AI-powered personalization: Recommend events based on purchase history, location preferences, and music taste. Predict which price points convert best for each user segment.
- Social features: Group purchases with shared payment, seat recommendations near friends, social event discovery based on friends’ attendance.
- Virtual venue tours: VR/AR seat previews allowing users to see the view from their seats before purchasing. Increases confidence in higher-priced seats.
- Carbon-neutral ticketing: Calculate and offset carbon emissions from events, integrate with venue sustainability programs, badge for eco-friendly events.
- Advanced analytics: Real-time demand forecasting predicts sellout times, optimizes on-sale strategies, and adjusts marketing spend dynamically.
This architecture successfully addresses Ticketmaster’s core challenges: maintaining absolute consistency to prevent double-booking, handling extreme traffic spikes through controlled queuing, detecting sophisticated fraud across multiple signals, and providing excellent user experience through real-time updates and fast response times. The system is proven at scale, handling events from small local shows to global phenomena like Taylor Swift tours where ticket demand exceeds supply by 100x.
Summary
This comprehensive guide covered the design of an event ticketing platform like Ticketmaster, including:
- Core Functionality: Event discovery, real-time seat selection with visual maps, temporary seat holds, secure payment processing, ticket delivery, virtual queuing for high-demand events, and resale marketplace.
- Key Challenges: Preventing double-booking under extreme concurrency, handling massive traffic spikes (50M concurrent users), detecting sophisticated bot and fraud activity, maintaining strong consistency for inventory, and ensuring durable execution of complex purchase workflows.
- Solutions: PostgreSQL pessimistic locking with partitioning, Redis-based virtual queue with controlled admission, multi-layer fraud detection with ML, saga pattern with compensating transactions, dynamic pricing based on demand signals, and multi-region deployment for global scale.
- Scalability: Horizontal scaling for stateless services, database partitioning and read replicas, Redis clustering for cache and queue, Kafka partitioning for message processing, and CDN/edge caching for read-heavy operations.
The design demonstrates how to build a high-consistency, high-availability system for low-frequency but high-value transactions, with sophisticated protection against concurrency issues, fraud, and system failures while maintaining excellent user experience under extreme load conditions.
Comments