Design TaskRabbit
TaskRabbit is a marketplace platform connecting people who need help with everyday tasks to skilled and vetted Taskers in their local area. At scale, this involves sophisticated matching algorithms, real-time availability management, secure payment processing, and comprehensive trust and safety systems.
Designing TaskRabbit presents unique challenges including bidirectional marketplace matching, calendar synchronization, escrow payment systems, background verification workflows, and maintaining service quality across diverse task categories.
Step 1: Understand the Problem and Establish Design Scope
Before diving into the design, it’s crucial to define the functional and non-functional requirements. For user-facing applications like this, functional requirements are the “Users should be able to…” statements, whereas non-functional requirements define system qualities via “The system should…” statements.
Functional Requirements
Core Requirements (Priority 1-3):
- Clients should be able to post tasks with description, category, location, budget, and preferred time.
- Clients should be able to browse and search for Taskers by skill, rating, availability, and location.
- The system should match tasks to qualified Taskers based on skills, location, availability, and other factors.
- Clients should be able to book Taskers either instantly or through a bidding system where Taskers submit proposals.
- The system should handle payment processing with escrow, holding funds until task completion.
- Both clients and Taskers should be able to leave reviews and ratings after task completion.
Below the Line (Out of Scope):
- Clients should be able to schedule recurring tasks.
- Clients should be able to book multiple Taskers for large jobs.
- Clients should be able to save favorite Taskers.
- The system should support photo uploads before and after task completion.
- The system should provide in-app messaging between clients and Taskers.
- The system should track location during active tasks.
Non-Functional Requirements
Core Requirements:
- The system should prioritize low latency for search and matching operations (< 200ms p95).
- The system should ensure strong consistency for payments and bookings to prevent double-booking.
- The system should be able to handle high throughput during peak times (100K tasks posted per day, 5M geospatial queries per day).
- The system should maintain 99.99% uptime with multi-region deployment.
Below the Line (Out of Scope):
- The system should comply with PCI DSS for payment data security.
- The system should provide end-to-end encryption for messages.
- The system should implement zero-downtime deployments.
- The system should maintain comprehensive audit logs for all financial transactions.
Clarification Questions & Assumptions:
- Platform: Mobile apps (iOS/Android) and web application.
- Scale: 10M+ registered users (7M clients, 3M Taskers), 500K daily active users, 50K concurrent active tasks.
- Categories: Support 100+ task categories including handyman, cleaning, moving, delivery, assembly, and more.
- Geographic Coverage: Focus on major metropolitan areas across multiple countries.
- Payment Processing: Integration with third-party payment processors (Stripe).
- Background Checks: Integration with third-party verification services (Checkr).
Step 2: Propose High-Level Design and Get Buy-in
Planning the Approach
Before moving on to designing the system, it’s important to plan your strategy. For marketplace platforms, we need to consider both sides of the marketplace: the supply side (Taskers) and the demand side (Clients). We’ll build our design sequentially, going through each functional requirement while ensuring both sides of the marketplace are properly served.
Defining the Core Entities
To satisfy our key functional requirements, we’ll need the following entities:
User: Base entity for all users, containing authentication credentials, personal information, and preferences. Both clients and Taskers are users with different roles and capabilities.
Task: A job posting created by a client. Includes description, category, location (with geohash for efficient querying), budget, scheduling preferences (flexible or specific time), status tracking, and references to the client and assigned Tasker.
Tasker Profile: Extended profile for users who provide services. Contains business information, skills and certifications, service areas with radius, hourly rates, aggregate ratings and statistics, verification status (background check, identity, insurance), availability calendar, and performance metrics (completion rate, response time, repeat client rate).
Booking: Links a task to a Tasker, tracking the entire lifecycle. Records scheduled time, duration, pricing (hourly or fixed), payment status, timeline events (created, confirmed, started, completed), and cancellation information if applicable.
Payment: Handles the financial transaction for a booking. Tracks authorization, capture, and release of funds, integrates with Stripe payment processing, manages escrow holding, calculates platform fees, and stores transaction records for accounting compliance.
Review: Bidirectional feedback system between clients and Taskers. Contains rating (1-5 stars), detailed comments, attribute-specific ratings (quality, punctuality, communication), timestamps, and fraud detection flags.
Proposal: When using the bidding system, Taskers submit proposals. Contains custom pricing, estimated hours, proposed datetime, cover letter, expiration time, and current status.
Verification: Tracks various verification checks for Taskers. Records background check results, identity verification, business licenses, insurance coverage, and expiration dates for periodic renewal.
API Design
Task Management Endpoints:
POST /tasks -> Task
GET /tasks/:id -> Task
PUT /tasks/:id -> Task
DELETE /tasks/:id -> Success
GET /tasks -> Task[]
GET /tasks/search -> Task[]
POST /tasks/:id/assign -> Task
POST /tasks/:id/complete -> Task
POST /tasks/:id/dispute -> Dispute
Tasker Discovery Endpoints:
GET /taskers/search -> Tasker[]
GET /taskers/:id -> Tasker
PUT /taskers/:id/availability -> Success
GET /taskers/:id/reviews -> Review[]
POST /taskers/:id/favorite -> Success
Booking Management Endpoints:
POST /bookings -> Booking
GET /bookings/:id -> Booking
PUT /bookings/:id/confirm -> Booking
PUT /bookings/:id/cancel -> Booking
GET /bookings/availability -> Availability
Payment Processing Endpoints:
POST /payments/methods -> PaymentMethod
POST /payments/authorize -> Authorization
POST /payments/capture -> Payment
POST /payments/refund -> Refund
GET /payments/history -> Payment[]
Bidding System Endpoints:
POST /proposals -> Proposal
GET /proposals/:id -> Proposal
PUT /proposals/:id/accept -> Booking
GET /tasks/:id/proposals -> Proposal[]
Review Endpoints:
POST /reviews -> Review
GET /reviews/:id -> Review
GET /users/:id/reviews -> Review[]
PUT /reviews/:id/respond -> Review
High-Level Architecture
Let’s build up the system sequentially, addressing each functional requirement:
1. Clients should be able to post tasks with description, category, location, budget, and preferred time
The core components necessary for task posting are:
- Client Application: Available on iOS, Android, and web. Provides interfaces for creating task posts with all required details. Uses device GPS for location services and integrates with maps for address selection.
- API Gateway: Entry point for all client requests, managing authentication via JWT tokens, rate limiting to prevent abuse, request routing to appropriate microservices, and SSL/TLS termination for secure communication.
- Task Service: Manages the lifecycle of tasks from creation to completion. Validates task data, determines appropriate booking type (instant or bidding), stores task information in the database, and publishes events to notify other services.
- PostgreSQL Database: Primary data store for tasks, users, bookings, and payments. Provides ACID guarantees for critical operations, uses PostGIS extension for geospatial queries, and implements proper indexing for performance.
- Elasticsearch: Secondary search index optimized for full-text search and geospatial queries. Enables fast category-based filtering, location radius searches, and multi-criteria ranking. Data is eventually consistent with the primary database.
Task Posting Flow:
- The client fills out task details in the application and submits the form, triggering a POST request to the Task Service.
- The API Gateway authenticates the request using the JWT token, applies rate limiting rules, and forwards to the Task Service.
- The Task Service validates the task data, determines if it’s suitable for instant booking or should go to the bidding system based on complexity and budget.
- The service creates a Task entity in PostgreSQL and indexes it in Elasticsearch for searchability.
- The Task Service publishes a task created event to the message queue for downstream processing.
- The service returns the created Task entity to the client application.
2. Clients should be able to browse and search for Taskers by skill, rating, availability, and location
We extend our design to support Tasker discovery:
- Tasker Service: Manages Tasker profiles, skills, and certifications. Handles onboarding workflows for new Taskers, maintains service area definitions with geographic boundaries, and synchronizes profile data to Elasticsearch for efficient searching.
- Elasticsearch Tasker Index: Specialized index containing Tasker profiles with geospatial fields for location-based queries, skill arrays for category matching, rating scores for quality filtering, and availability flags for real-time filtering.
Tasker Search Flow:
- The client enters search criteria including category, location, date/time, and optional filters for rating or price range.
- The API Gateway forwards the search request to the Tasker Service.
- The Tasker Service constructs an Elasticsearch query combining geospatial filters (geo_distance for radius), skill matching (terms query on skills array), availability checks, and rating thresholds.
- Elasticsearch returns a ranked list of matching Taskers sorted by relevance score, distance, and rating.
- The Tasker Service enriches the results with real-time availability data from the cache and returns the list to the client.
3. The system should match tasks to qualified Taskers based on skills, location, availability, and other factors
We introduce new components for intelligent matching:
- Matching Service: Core algorithm engine that combines rule-based scoring with machine learning models. Queries Elasticsearch for candidate Taskers, applies multi-factor scoring algorithms considering skills match, location proximity, availability alignment, reputation metrics, price competitiveness, and response patterns.
- ML Ranking Model: Gradient boosted decision tree model trained on historical booking data. Predicts probability of successful booking completion given task-Tasker pairs. Features include Tasker performance metrics, historical interaction patterns, task complexity indicators, and contextual factors like time and location.
- Redis Cache: In-memory data store for high-speed access to frequently used data. Caches Tasker availability bitmaps, recent search results with short TTL, user session data, and rate limiting counters.
Matching Flow:
- When a task is posted, the Matching Service receives a notification via the message queue.
- The service performs an Elasticsearch query to find candidate Taskers within the service radius who have the required skills.
- For each candidate, the service calculates a multi-factor match score considering skills match (25 points for direct skill match), location proximity (20 points, decreasing with distance), availability alignment (15 points for matching schedule), rating and reputation (15 points for high-rated Taskers), price competitiveness (10 points for market-rate pricing), response rate and speed (10 points for responsive Taskers), and success metrics (5 points for completion rate).
- The ML model further ranks candidates based on predicted booking success probability.
- The service returns a ranked list of recommended Taskers to the client or notifies top candidates about the task opportunity.
4. Clients should be able to book Taskers either instantly or through a bidding system
The system supports two distinct booking workflows:
- Instant Booking: For standardized tasks with predictable pricing (furniture assembly, TV mounting, cleaning). Clients see fixed prices from pre-qualified Taskers and can book immediately with instant confirmation. Requires real-time availability checking and optimistic locking to prevent double-booking.
- Bidding System: For complex or custom tasks with variable pricing (home renovation, specialized repairs). Task is posted to the marketplace, multiple Taskers submit proposals with custom pricing and timelines, client reviews proposals and selects preferred Tasker, and booking is created upon acceptance.
Instant Booking Flow:
- After searching, the client selects a Tasker with instant booking enabled and chooses a time slot.
- The Booking Service checks real-time availability in Redis using bitmap data structures where each bit represents a 15-minute time slot.
- If available, the service acquires a distributed lock on the time slot to prevent race conditions.
- The service creates a Booking entity, blocks the time slots in the Tasker’s calendar, and triggers payment authorization.
- Upon successful payment authorization, the booking is confirmed and both parties are notified.
Bidding Flow:
- The client posts a task, and the system identifies it as suitable for bidding based on complexity or budget thresholds.
- The Task Service publishes the task to the marketplace and notifies matching Taskers via push notifications.
- Interested Taskers submit proposals with custom pricing, estimated hours, and proposed schedules.
- The Proposal Service stores proposals and notifies the client of each new submission.
- The client reviews proposals, considering price, Tasker ratings, proposal details, and response time.
- Upon selecting a proposal, the Booking Service creates a booking and triggers payment authorization.
5. The system should handle payment processing with escrow, holding funds until task completion
We add payment infrastructure:
- Payment Service: Orchestrates the entire payment lifecycle. Integrates with Stripe for payment processing, manages escrow holding and release, handles refund workflows, calculates platform fees (typically 15%), and stores transaction records for compliance and accounting.
- Stripe Integration: Third-party payment processor handling sensitive card data. Uses Payment Intents API with manual capture for escrow functionality, Stripe Connect for Tasker payouts, webhook processing for asynchronous events, and PCI DSS compliance.
Payment Flow:
- When a booking is confirmed, the Payment Service creates a Stripe Payment Intent with manual capture, authorizing the estimated amount on the client’s payment method.
- The authorized funds are held but not captured, allowing for cancellation without charges.
- When the task starts, the system captures the authorized payment, moving funds into escrow.
- Upon task completion, there’s a 24-hour dispute window during which either party can raise issues.
- After the dispute window passes, the Payment Service transfers funds to the Tasker’s Stripe Connect account minus the platform fee.
- Transaction records are stored in PostgreSQL for accounting, auditing, and tax reporting purposes.
Refund Handling: If a task is cancelled or disputed, the system processes refunds by checking payment status (authorized vs. captured), issuing Stripe refund for captured payments, canceling payment intent for authorized-only payments, reversing transfers if funds were already paid to Tasker, and updating booking status and payment records.
6. Both clients and Taskers should be able to leave reviews and ratings after task completion
We complete the trust ecosystem:
- Review Service: Manages bidirectional review collection and aggregation. Enforces review policies (only after task completion, one review per party per booking), detects fraudulent review patterns, calculates time-weighted aggregate ratings giving more weight to recent reviews, and updates Tasker profiles and search indices.
- Fraud Detection System: Analyzes reviews for suspicious patterns including same-day review volume spikes, text similarity detection for copy-paste reviews, new account reviews flagged for verification, consistent rating patterns (all 5-stars), and IP-based anomaly detection.
Review Flow:
- After task completion, both the client and Tasker receive prompts to submit reviews.
- The reviewer submits a rating (1-5 stars), optional detailed comment, and attribute-specific ratings (quality, punctuality, communication).
- The Review Service validates that the reviewer was part of the booking and hasn’t already reviewed it.
- The fraud detection system analyzes the review for suspicious patterns and may flag for manual moderation.
- The service stores the review and asynchronously updates aggregate ratings on the Tasker profile.
- The updated rating is synchronized to Elasticsearch to affect future search rankings.
- The reviewee is notified and can respond to the review within a time window.
Step 3: Design Deep Dive
With the core functional requirements met, it’s time to dig into the non-functional requirements via deep dives. These are the critical areas that separate good designs from great ones.
Deep Dive 1: How do we implement efficient geospatial matching at scale?
TaskRabbit needs to perform millions of geospatial queries per day to match clients with nearby Taskers. Traditional latitude/longitude queries are inefficient at scale.
Problem: Inefficient Geographic Queries
Without optimization, finding Taskers within a radius requires calculating distances for all Taskers in the database, resulting in O(N) complexity with expensive trigonometric calculations. Even with indexes on latitude and longitude columns, standard B-tree indexes don’t work well for multi-dimensional geographic data.
Solution: Geohash and Specialized Indexes
We use multiple strategies for efficient geospatial operations:
Geohash Encoding: A hierarchical spatial data structure that encodes latitude and longitude into a single string. Nearby locations share common prefixes, enabling prefix-based searches. For example, geohash “9q8yy” represents a specific area in San Francisco, and all locations starting with “9q8yy” are nearby. Different precision levels allow trading accuracy for performance.
PostGIS for Primary Database: PostgreSQL with PostGIS extension provides native geographic data types and functions. Store locations as GEOMETRY or GEOGRAPHY types, use R-tree spatial indexes (GiST) for fast proximity queries, and support complex queries like radius searches, polygon containment, and distance calculations. The R-tree automatically balances and optimizes as data grows.
Elasticsearch for Search: The geo_point data type combined with geo_distance and geo_bounding_box queries provides fast geospatial search. Results can combine geographic proximity with other filters like skills, availability, and rating. The distributed nature allows horizontal scaling of search capacity.
Implementation Approach: When a task is posted with location coordinates, we compute the geohash and store it alongside lat/long in both PostgreSQL and Elasticsearch. For matching queries, we first perform a bounding box filter to narrow candidates, then use precise distance calculations on the reduced set. Redis caches recent proximity search results with 60-second TTL. Service areas are pre-computed polygons checked against task locations.
Performance Optimization: By combining geohash prefix filtering with R-tree indexes, we reduce the candidate set by 95% before precise calculations. Most queries complete in under 50ms at p95. Geographic sharding by region further distributes load across database clusters.
Deep Dive 2: How do we prevent double-booking and calendar conflicts?
Real-time availability management is crucial for instant booking. We must prevent race conditions where multiple clients try to book the same Tasker for overlapping times.
Problem: Calendar Conflicts
With 50K concurrent active bookings and thousands of booking requests per minute, race conditions are inevitable without proper synchronization. Traditional database row locking is too slow for real-time booking flows. We need sub-second response times while guaranteeing no double-bookings.
Solution: Redis Bitmap Calendar with Distributed Locking
We use Redis bitmaps to represent availability with exceptional space efficiency and speed:
Bitmap Representation: Each Tasker’s availability for a given date is stored as a bitmap with 96 bits (24 hours × 4 slots per hour = 96 15-minute slots). Bit value 1 indicates available, 0 indicates blocked. For example, bit 36 represents 09:00-09:15. A 2-hour booking starting at 09:00 requires bits 36-43 (8 consecutive 15-minute slots).
Distributed Locking: Before modifying availability, we acquire a lock on the Tasker’s calendar using Redis SET with NX and EX flags. The NX flag ensures atomic creation only if the key doesn’t exist. The EX flag sets expiration (typically 5 seconds) to prevent deadlocks. If lock acquisition fails, we return a conflict error immediately.
Availability Check Flow:
- Client requests to book a Tasker for a specific time and duration.
- The Booking Service acquires a distributed lock for that Tasker’s calendar.
- It reads the bitmap for the requested date and checks if all required time slots are available (all bits are 1).
- If available, it flips those bits to 0 (marking as blocked), creates the booking record, and releases the lock.
- If any slot is blocked, it returns a conflict error and releases the lock without changes.
- The lock automatically expires after 5 seconds even if the service crashes, preventing permanent blocking.
Optimistic Locking Alternative: For less critical operations, we use versioning. Each calendar update increments a version number. When updating, we check that the version matches expectations. If it changed, someone else modified the calendar and we retry.
Calendar Synchronization: For Taskers who integrate external calendars (Google Calendar, Outlook), we run periodic sync jobs every 15 minutes. External events are fetched and corresponding time slots are blocked in Redis. We use exponential backoff for sync failures and notify Taskers of sync issues.
Deep Dive 3: How do we implement secure escrow payment processing?
Payment handling requires careful orchestration to protect both clients and Taskers while maintaining compliance.
Problem: Trust and Security in Payments
Clients need assurance that Taskers will complete the work before payment. Taskers need assurance they’ll be paid for completed work. The platform must hold funds securely, handle disputes, and comply with financial regulations.
Solution: Multi-Stage Payment with Stripe
We implement a carefully orchestrated payment flow:
Authorization Phase: When a booking is confirmed, we create a Stripe Payment Intent with manual capture mode. This authorizes the payment method and holds funds with the card issuer but doesn’t charge the client yet. The authorization is valid for 7 days (Stripe’s limit). We store the Payment Intent ID for later capture.
Capture Phase: When the task starts (or at booking confirmation for advance scheduling), we capture the authorized payment. This charges the client’s payment method and moves funds to the platform’s Stripe account (escrow). The timing depends on the booking model and cancellation policy.
Holding Period: After task completion, there’s a mandatory 24-hour dispute window. During this time, funds remain in the platform’s Stripe account. Either party can file a dispute that triggers a manual review workflow. The dispute system freezes further payment processing until resolution.
Release Phase: After the dispute window expires with no disputes, we initiate a transfer to the Tasker’s Stripe Connect account. We calculate the platform fee (typically 15% of the transaction), subtract it from the total, and transfer the remainder. The transfer includes metadata for reconciliation and tax reporting.
Refund Scenarios: For cancellations before capture, we simply cancel the Payment Intent with no charge to the client. For cancellations after capture but before completion, we issue a Stripe refund with the appropriate amount based on cancellation policy. For disputes, we may issue partial refunds after investigation. If funds were already transferred to a Tasker, we create a reversal on the Stripe transfer to reclaim funds before refunding the client.
Compliance Considerations: All payment data is stored in Stripe’s PCI-compliant vault, never touching our servers. Transaction records in our database contain only references (Payment Intent IDs) and metadata. We maintain detailed audit logs of all payment state transitions for regulatory compliance. Records are retained for 7 years to meet financial reporting requirements.
Deep Dive 4: How do we handle background checks and verification at scale?
Trust and safety is paramount for a marketplace connecting strangers in physical spaces.
Problem: Verification Scalability and Speed
With thousands of new Tasker applications daily, manual verification is impossible. Background checks take time but users expect fast onboarding. Different jurisdictions have different legal requirements. Verification status must be tracked and renewed periodically.
Solution: Automated Verification Pipeline with Third-Party Integration
We implement a multi-stage verification system:
Identity Verification: Integration with Stripe Identity for document verification. Taskers upload government-issued ID and complete a liveness check (selfie matching). Stripe’s API returns verification results including name match, date of birth verification, address verification, and document authenticity. This completes in minutes with automated fraud detection.
Background Check: Integration with Checkr for comprehensive background screening. When a Tasker completes identity verification, we automatically initiate a background check. The check includes criminal record search (federal and county), sex offender registry check, identity verification (SSN trace), and motor vehicle report (for delivery tasks). Checkr sends webhook notifications as results arrive (usually 1-3 business days).
Automated Decision Engine: When background check results arrive via webhook, our system automatically evaluates them against predefined criteria. Auto-reject conditions include sex offender registry records, violent felonies within 7 years, identity verification failure, and theft or fraud convictions within 5 years. Auto-approve conditions include clean record or only minor infractions over 10 years old. Manual review is required for intermediate cases like non-violent misdemeanors, sealed or expunged records, or foreign records requiring interpretation.
Verification Status Tracking: Each verification has status tracking (pending, approved, rejected, expired) and expiration dates (background checks expire after 1 year). Automated renewal notifications are sent 30 days before expiration. Expired verifications result in automatic deactivation of the Tasker profile until renewal is complete. Webhook handling ensures eventually consistent status updates.
Insurance Verification: For certain high-risk categories (electrical work, plumbing, roofing), we require proof of general liability insurance. Taskers upload policy documents which are verified manually or through insurance provider API integrations. We track coverage amounts, policy numbers, and expiration dates. Automated checks prevent booking for tasks requiring insurance when coverage is missing or expired.
Compliance and Privacy: Background check data is stored in isolated database with encrypted fields for sensitive information. Access is strictly controlled with audit logging. Data retention follows legal requirements (typically 7 years). Taskers can request copies of their verification reports. We comply with Fair Credit Reporting Act (FCRA) requirements including pre-adverse action notices and dispute resolution processes.
Deep Dive 5: How do we implement intelligent matching that improves over time?
Simple rule-based matching misses important patterns. Machine learning can optimize for booking success.
Problem: Suboptimal Matching
Rule-based scoring is rigid and doesn’t adapt to changing patterns. Different task categories may value different factors. Seasonal trends affect availability and pricing. User preferences evolve over time. A match that looks good on paper may fail in practice.
Solution: Hybrid Scoring with Machine Learning
We combine rule-based scoring with learned models:
Rule-Based Foundation: The base scoring algorithm provides explainable, consistent results considering skills match (25 points), location proximity (20 points), availability alignment (15 points), rating and reputation (15 points), price competitiveness (10 points), response rate and speed (10 points), and success metrics (5 points). This ensures minimum standards are met and provides fallback when ML models fail.
Feature Engineering: For each task-Tasker pair, we extract features including Tasker attributes (rating, completion rate, response time, task count, years of experience), match characteristics (distance, skill match level, availability overlap, price ratio to market rate), historical interactions (previous bookings between this client and Tasker, client’s booking history, Tasker’s client retention rate), contextual factors (time of day, day of week, seasonality, local events), and task complexity indicators (budget size, description length, task duration).
Model Training: We train gradient boosted decision trees (using XGBoost) on historical booking data. The training dataset includes 10M+ historical booking attempts labeled as successful (booking completed with 4+ star rating) or unsuccessful (cancelled, disputed, or low rating). We perform weekly retraining with fresh data to capture evolving patterns. Cross-validation prevents overfitting. Feature importance analysis guides feature engineering.
Prediction and Ranking: For a new task, we generate the candidate set using Elasticsearch geo-queries and skill filtering. For each candidate, we calculate both the rule-based score and ML model prediction probability. The final ranking combines both: final_score = 0.6 × ml_probability + 0.4 × rule_score. This hybrid approach provides the learning capabilities of ML while maintaining explainability and stability.
A/B Testing and Evaluation: We continuously A/B test matching algorithms with metrics including booking completion rate, time to first booking, average rating, client repeat rate, and Tasker utilization. Statistical significance testing ensures changes improve outcomes. Failed experiments are quickly rolled back. Successful experiments are gradually rolled out to all users.
Model Monitoring: We monitor prediction quality metrics including calibration (predicted probabilities match actual rates), feature drift (input distributions change over time), and prediction latency (models respond within 50ms). Automated alerts trigger investigation when metrics degrade. Shadow mode testing evaluates new models without affecting production.
Deep Dive 6: How do we detect and prevent review fraud?
Review systems are vulnerable to manipulation that erodes trust.
Problem: Review Manipulation
Malicious actors may attempt to boost ratings through fake reviews, attack competitors with negative reviews, or game the system with review-for-review schemes. Automated fraud detection is essential at scale.
Solution: Multi-Layer Fraud Detection
We implement several detection strategies:
Pattern Analysis: Statistical analysis identifies anomalous patterns. Same-day review volume spikes (more than 5 reviews for one Tasker in a day) trigger investigation. Reviewer behavior analysis flags accounts that only give 5-star or 1-star reviews. Geographic clustering detects review fraud rings (multiple reviews from same IP block). Timing analysis identifies coordinated review campaigns.
Text Analysis: Natural language processing detects suspicious text. Duplicate detection uses fuzzy matching to find copy-paste reviews (90%+ similarity). Template detection identifies reviews following suspicious patterns. Sentiment analysis flags inconsistent rating-text pairs (5-star rating with negative language). Gibberish detection catches randomly generated text.
Account Analysis: New account reviews are scrutinized as throwaway accounts are common in fraud. Account age threshold requires accounts to be at least 24 hours old before reviewing. Verified booking requirement ensures reviewer actually used the service. Payment verification confirms financial transaction occurred. Phone verification adds friction for bulk account creation.
Graph Analysis: Relationship mapping identifies fraud networks. Review exchange detection finds reciprocal reviewing between accounts. Tasker relationships identify coordinated accounts reviewing same Taskers. Client-Tasker relationship analysis detects personal connections.
Manual Review Queue: High-risk reviews are flagged for human moderation based on fraud score from automated systems. Moderators review flagged content, supporting evidence, account histories, and communication logs. Decisions include approve and publish, reject and hide, request additional information, or suspend account for investigation. Feedback from manual review retrains automated models.
Consequences: Confirmed fraudulent reviews are removed and don’t affect ratings. Offending accounts face graduated penalties from warning notices, temporary suspension, permanent ban, to legal action for systematic fraud. Victims of fraudulent negative reviews can request expedited review and removal.
Deep Dive 7: How do we handle high traffic during peak demand periods?
Weekend and holiday demand can be 3x normal load. The system must scale gracefully.
Problem: Traffic Spikes
Peak times see dramatic increases in task posting, searches, and bookings. Database hotspots emerge for popular Taskers. Payment processing slows under load. Real-time features degrade. Users abandon the platform if performance suffers.
Solution: Comprehensive Scaling Strategy
We implement multiple scaling techniques:
Horizontal Service Scaling: All services are stateless and deployed in auto-scaling groups. CPU and memory metrics trigger scale-up within minutes. Pre-emptive scaling uses historical patterns to scale before peak times. Database connection pooling prevents connection exhaustion. Load balancers distribute traffic with health checks.
Database Scaling: Read replicas handle the majority of traffic (search queries, profile views, task browsing). Write operations go to the primary database with replication to followers. Geographic sharding distributes data by region since most interactions are local. Hot Tasker data is cached aggressively in Redis. Elasticsearch handles search load independently with its own cluster scaling.
Caching Strategy: Multi-layer caching reduces database load. Application-level cache (in-memory) stores frequently accessed objects with 5-minute TTL. Redis cache layer stores search results (60s TTL), Tasker profiles (5min TTL), availability data (30s TTL), and session data (30min TTL). CDN caches static assets, profile photos, and public task listings.
Message Queue Buffering: Kafka message queues absorb traffic spikes. Task creation events, matching requests, notification delivery, and payment processing are all asynchronous through queues. Consumer groups scale independently of producers. Message persistence ensures no data loss during failures. Backpressure mechanisms prevent overwhelming downstream services.
Rate Limiting: Aggressive rate limiting prevents abuse and ensures fair resource allocation. Per-user limits prevent individual users from overwhelming the system (100 requests per minute). Per-IP limits catch distributed attacks (1000 requests per minute). Endpoint-specific limits protect expensive operations (10 searches per minute). Authenticated users get higher limits than anonymous users.
Graceful Degradation: Under extreme load, we prioritize core functionality. Non-critical features degrade first (analytics updates, email notifications, photo processing). Search results may return slightly stale data from cache. Payment authorization has highest priority and never degrades. Clear error messages inform users of temporary limitations.
Capacity Planning: Continuous monitoring tracks resource utilization. Forecasting models predict capacity needs based on growth trends and seasonality. Load testing simulates peak traffic to identify bottlenecks. Quarterly capacity reviews plan infrastructure expansion. Cost optimization balances performance with expenditure.
Step 4: Wrap Up
In this chapter, we proposed a system design for a marketplace platform like TaskRabbit. If there is extra time at the end of the interview, here are additional points to discuss:
Additional Features:
- Recurring Tasks: Subscription model for regular services like weekly cleaning or monthly maintenance with automatic scheduling and payments.
- Team Dispatch: Multi-Tasker coordination for large jobs requiring multiple people or skills with team formation algorithms and load balancing.
- Smart Pricing: Dynamic pricing based on demand, time of day, Tasker experience, and task complexity using supply-demand algorithms.
- Quality Guarantees: Money-back guarantee program with automated claim processing and quality assurance workflows.
- Skills Certification: Platform-verified skill testing with partnerships with certification bodies and skill assessments.
- Real-time Tracking: WebSocket connections for live location tracking during active tasks with privacy controls.
- AI Chatbot: Automated customer support for common questions using natural language processing and intent classification.
Scaling Considerations:
- Geographic Sharding: Partition data by region since most operations are geographically local with cross-region replication for disaster recovery.
- Database Optimization: Implement read replicas for scaling reads, connection pooling to manage connections efficiently, and query optimization with proper indexing.
- Microservice Architecture: Decompose services further for independent scaling with service mesh for inter-service communication.
- Event-Driven Architecture: Increase asynchronicity using event sourcing for audit trails and CQRS pattern for read-write separation.
Error Handling:
- Payment Failures: Retry logic with exponential backoff for transient failures, fallback to alternative payment methods, clear error messages to users, and automatic refund processing for failures after capture.
- Service Failures: Circuit breakers to prevent cascading failures, graceful degradation of non-critical features, health checks and automatic service recovery, and failover to backup services.
- Network Failures: Idempotency keys prevent duplicate operations, request timeout handling with appropriate user feedback, retry policies with jitter to prevent thundering herd, and message queue persistence ensures no data loss.
Security Considerations:
- Data Protection: PII encryption at rest using AES-256, TLS 1.3 for data in transit, tokenization of sensitive payment data, and secure key management with rotation policies.
- Access Control: Role-based access control (RBAC) for internal tools, multi-factor authentication for sensitive operations, API key management for third-party integrations, and audit logging of all access to sensitive data.
- Fraud Prevention: IP-based anomaly detection for suspicious patterns, device fingerprinting to track malicious users, payment velocity checks to prevent fraud, and account takeover protection with behavioral analysis.
Monitoring and Analytics:
- Business Metrics: Track task posting rate, booking completion rate, average booking value, Tasker utilization rate, client retention rate, and revenue metrics by category and region.
- Performance Metrics: Monitor API latency (p50, p95, p99), database query performance, cache hit rates, error rates by endpoint, and system resource utilization.
- Operational Metrics: Track background check processing time, payment success rate, notification delivery rate, matching success rate, and verification backlog.
- Alerting: Configure alerts for payment failure rate exceeding 2%, booking conflict rate above 0.5%, search latency p95 over 300ms, verification backlog exceeding 500, and unusual fraud detection patterns.
Future Improvements:
- Machine Learning Enhancements: Demand prediction for optimal Tasker positioning, churn prediction to retain high-value users, price optimization using reinforcement learning, and task duration estimation from descriptions.
- Platform Expansion: B2B services for enterprise clients, white-label solutions for other marketplaces, API for third-party integrations, and international expansion with localization.
- Sustainability: Route optimization for eco-friendly service delivery, carbon footprint tracking and offsetting, promote Taskers using electric vehicles, and paperless documentation.
Congratulations on getting this far! Designing TaskRabbit is a complex system design challenge involving marketplace dynamics, real-time coordination, financial transactions, and trust & safety. The key is building a foundation that serves both sides of the marketplace while maintaining quality, security, and scalability.
Summary
This comprehensive guide covered the design of a marketplace platform like TaskRabbit, including:
- Core Functionality: Task posting, Tasker discovery, intelligent matching, instant booking and bidding systems, escrow payments, and bidirectional reviews.
- Key Challenges: Geospatial queries at scale, calendar conflict prevention, secure payment processing, automated verification pipelines, review fraud detection, and peak load handling.
- Solutions: PostGIS and Elasticsearch for geospatial operations, Redis bitmaps with distributed locking for calendars, Stripe integration with escrow workflows, automated verification with Checkr integration, ML-enhanced matching algorithms, and comprehensive fraud detection systems.
- Scalability: Horizontal scaling with stateless services, database read replicas and sharding, multi-layer caching strategies, message queues for load buffering, and graceful degradation under extreme load.
The design demonstrates how to build a two-sided marketplace with strong consistency for critical operations, eventual consistency for search and discovery, sophisticated matching algorithms, and comprehensive trust and safety systems that scale to millions of users.
Comments