Design Instacart

Instacart is a comprehensive grocery delivery platform that connects customers with personal shoppers who pick and deliver groceries from local stores. The system must handle millions of products across thousands of stores, intelligent shopper-order matching, real-time inventory synchronization, substitution recommendations, and precise delivery time slot management.

This design focuses on building a production-grade system that handles peak demand (weekends, holidays), maintains real-time consistency between store inventory and platform data, optimizes batch order fulfillment, and provides seamless communication throughout the shopping and delivery process.

Step 1: Understand the Problem and Establish Design Scope

Before diving into the design, it’s crucial to define the functional and non-functional requirements. For user-facing applications like this, functional requirements are the “Users should be able to…” statements, whereas non-functional requirements define system qualities via “The system should…” statements.

Functional Requirements

Core Requirements (Priority 1-3):

Customers should be able to browse and search products from multiple grocery stores with real-time availability.
Customers should be able to add items to a cart, apply promotions, and place orders with selected delivery time slots.
Upon order placement, the system should assign a nearby available shopper, potentially batching multiple orders for efficiency.
Shoppers should be able to shop at stores, scan items, handle out-of-stock situations with substitutions, and deliver to customers.
The system should maintain real-time inventory synchronization from partner store POS systems.
The system should provide real-time order tracking with GPS location updates and ETA calculations.

Below the Line (Out of Scope):

Customers should be able to rate shoppers and provide feedback.
Shoppers should be able to track earnings and performance metrics.
The system should support scheduled orders for future delivery.
The system should handle multi-store orders (pharmacy + grocery).
The system should provide loyalty programs and personalized recommendations.

Non-Functional Requirements

Core Requirements:

The system should provide low latency for product search (< 200ms) and order placement (< 2 seconds).
The system should ensure strong consistency for inventory reservations to prevent overselling.
The system should maintain eventual consistency for catalog updates and analytics.
The system should handle high throughput during peak periods (300+ orders per second).
The system should achieve 99.99% availability with graceful degradation during partial outages.

Below the Line (Out of Scope):

The system should comply with data privacy regulations (GDPR, CCPA).
The system should provide comprehensive monitoring, logging, and alerting.
The system should support CI/CD pipelines for continuous deployment.
The system should handle disaster recovery with RPO < 5 minutes.

Clarification Questions & Assumptions:

Platform: Mobile apps (iOS/Android) for customers and shoppers, plus web browser for customers.
Scale: 50 million monthly active users, 500K active shoppers, 10 million orders per day during peak.
Catalog Size: 500K unique products across 50K stores globally.
Location Updates: Shoppers update location every 10-15 seconds during delivery.
Inventory Sync: Real-time updates from partner POS systems with 30-second to 5-minute latency.
Order Size: Average 40 items per order.

Step 2: Propose High-Level Design and Get Buy-in

Planning the Approach

For this grocery delivery platform, we’ll build up the design sequentially, going through each functional requirement one by one. This systematic approach ensures we address all core functionality before diving into optimizations.

Defining the Core Entities

To satisfy our key functional requirements, we’ll need the following entities:

Customer: Users who browse products and place orders. Contains personal information, delivery addresses, payment methods, dietary preferences, and order history.

Shopper: Independent contractors who fulfill orders. Includes personal details, vehicle information, current location, availability status, active order capacity (up to 3 concurrent orders), and performance metrics (rating, completion rate, speed).

Product: Individual grocery items in the catalog. Contains name, description, images, barcode, price, nutritional information, category, and brand. Products are store-specific since pricing and availability vary by location.

Store: Partner grocery stores where shopping occurs. Includes location, operating hours, store layout (aisle information), supported delivery zones, and POS system integration details.

Order: A customer’s grocery order from placement to delivery. Contains items, quantities, store location, delivery address, time slot, assigned shopper, current status, substitutions made, and final price.

Cart: Temporary storage for items before checkout. Tracks selected products, quantities, special instructions, and substitution preferences. Persists for authenticated users, session-based for guests.

Inventory: Real-time stock levels for products at specific stores. Tracks available quantity, reserved quantity (for in-progress checkouts), last sync timestamp, and predicted stockout time.

Delivery Slot: Time windows available for delivery (e.g., 2-hour windows). Contains start/end times, capacity (maximum orders), current bookings, and surge pricing multiplier.

Substitution: Replacement for an out-of-stock item. Links original product to substitute, includes price difference, customer approval status, and learning data for future recommendations.

API Design

Product Search Endpoint: Used by customers to search for products with filters and autocomplete.

GET /catalog/search -> ProductList
Query params: {
  q: string,
  storeId: string,
  filters: string[],
  page: number
}

Get Product Details Endpoint: Retrieves detailed information about a specific product including real-time availability.

GET /catalog/products/:productId -> Product
Query params: {
  storeId: string
}

Add to Cart Endpoint: Adds items to the customer’s shopping cart with real-time validation.

POST /cart/items -> Cart
Body: {
  productId: string,
  quantity: number,
  storeId: string
}

Place Order Endpoint: Confirms order placement after cart validation and initiates shopper matching.

POST /orders -> Order
Body: {
  cartId: string,
  deliverySlotId: string,
  deliveryAddress: Address,
  paymentMethodId: string
}

Update Shopper Location Endpoint: Used by shoppers to send real-time GPS coordinates during shopping and delivery.

POST /shoppers/location -> Success/Error
Body: {
  lat: number,
  long: number
}

Note: The shopperId is present in the authentication token and not in the request body.

Accept Order Assignment Endpoint: Allows shoppers to accept or decline assigned orders.

PATCH /orders/:orderId/assignment -> Order
Body: {
  action: "accept" | "decline"
}

Scan Item Endpoint: Used by shoppers to verify they’re picking the correct product during shopping.

POST /orders/:orderId/scan -> ScanResult
Body: {
  barcode: string,
  quantity: number
}

Request Substitution Approval Endpoint: Sends substitution suggestions to customers when items are unavailable.

POST /orders/:orderId/substitutions -> SubstitutionRequest
Body: {
  originalItemId: string,
  suggestions: Product[]
}

Approve Substitution Endpoint: Customers approve or reject suggested substitutions in real-time.

PATCH /orders/:orderId/substitutions/:requestId -> Order
Body: {
  action: "approve" | "reject" | "refund",
  selectedProductId: string?
}

High-Level Architecture

Let’s build up the system sequentially, addressing each functional requirement:

1. Customers should be able to browse and search products from multiple stores

The core components necessary for product discovery are:

Customer Client: Mobile and web applications for browsing and ordering. Provides rich product search, filtering, and cart management interfaces.
API Gateway: Entry point for all client requests. Handles authentication, rate limiting, request routing, and SSL termination. Routes requests to appropriate microservices.
Catalog Service: Manages the product catalog with search, filtering, and browsing capabilities. Integrates with Elasticsearch for full-text search with autocomplete, faceted filtering (organic, price ranges, dietary restrictions), and category navigation. Caches frequently accessed products in Redis for sub-100ms response times.
Inventory Service: Maintains real-time stock levels across all stores. Receives inventory updates from partner POS systems, provides availability checks during browsing, and manages inventory reservations during checkout.
Store Integration Hub: Adapters for various POS systems (Square, NCR, SAP, custom systems). Normalizes inventory data from different formats into a unified schema and publishes updates to Kafka topics.
Database: PostgreSQL for structured product data, store information, and transactional consistency. Elasticsearch for search indices with real-time product availability.
CDN: Content Delivery Network for serving product images, optimizing load times globally.

Product Search Flow:

The customer enters a search query or browses categories in the client app, sending a GET request to the Catalog Service.
The API Gateway authenticates the request and forwards it to the Catalog Service.
The Catalog Service queries Elasticsearch for matching products, applying filters and ranking based on relevance and popularity.
For each product in results, the service checks inventory availability from the Inventory Service (cached in Redis).
Results are enriched with real-time stock status (“In Stock”, “Low Stock”, “Out of Stock”) and returned to the client.
Product images are served directly from the CDN for fast loading.

2. Customers should be able to add items to cart and place orders

We extend the design to support cart management and order placement:

Cart Service: Manages shopping cart state. For authenticated users, carts persist in PostgreSQL. For guest users, carts are stored in Redis with 24-hour TTL. Provides real-time price and availability validation before checkout.
Order Service: Handles order lifecycle from placement through delivery. Creates orders, manages state transitions, coordinates with inventory for reservations, and triggers matching workflows.
Payment Service: Integrates with payment processors (Stripe, PayPal) for authorization and capture. Authorizes payment during order placement but only captures after successful delivery.

Order Placement Flow:

Customer reviews their cart and selects a delivery time slot, then clicks “Place Order”, sending a POST request with cart details.
The Cart Service validates all items are still available and prices haven’t changed significantly.
The Inventory Service attempts to reserve all items atomically. If any item is unavailable, the entire reservation fails and the customer is notified.
The Payment Service authorizes (but doesn’t capture) the payment amount.
The Order Service creates a new order record with status “PLACED” and publishes an “order.placed” event to Kafka.
The order is queued for shopper assignment while the customer receives confirmation.

3. Orders should be matched with nearby available shoppers

We introduce components for intelligent shopper matching:

Shopper Client: Mobile app for shoppers to receive assignments, navigate stores, communicate with customers, and complete deliveries.
Shopper Service: Manages shopper profiles, availability status, location tracking, and capacity (maximum 3 concurrent orders). Maintains performance metrics like ratings and completion rates.
Matching Engine: Core algorithm for assigning orders to shoppers. Considers proximity (shoppers within 5 miles of store), availability, current batch opportunities (grouping nearby deliveries), ratings, and fairness (preventing cherry-picking). Uses geospatial queries against Redis for efficient proximity searches.
Notification Service: Sends push notifications to shoppers via APN (iOS) and FCM (Android) when new orders are assigned.

Shopper Assignment Flow:

When an order is placed, the Order Service publishes to a Kafka topic consumed by the Matching Engine.
The Matching Engine queries for eligible shoppers near the order’s store using Redis geospatial commands.
For each candidate shopper, it calculates a compatibility score based on distance, rating, current batch potential (can this order be grouped with their active orders?), and time since last assignment (fairness).
The engine attempts to acquire a distributed lock on the top-ranked shopper using Redis (prevents double-assignment).
If successful, it sends a push notification to the shopper with order details and a 30-second timeout to accept or decline.
If accepted, the order status updates to “ASSIGNED”. If declined or timeout, the lock releases and the next shopper is tried.
The process continues until a match is found or maximum retries exceeded.

4. Shoppers fulfill orders with real-time inventory feedback

The shopping and delivery experience requires:

Shopping Session Manager: Coordinates the active shopping process. Generates optimized shopping lists sorted by aisle for efficiency, tracks item scanning progress, and manages real-time updates to customers.
Substitution Service: Machine learning-powered recommendations for out-of-stock items. Considers product attributes (category, brand, size), past customer substitution preferences, price similarity, and what other customers accepted. Manages approval workflows with 2-minute timeouts.
Delivery Service: Route optimization for shoppers handling multiple orders. Integrates with Google Maps API for real-time traffic data, calculates ETAs, and verifies delivery completion (photo proof, geofencing).
Communication Service: WebSocket connections for real-time bidirectional communication between customers and shoppers. Enables instant substitution approvals, delivery updates, and chat functionality.

Shopping Flow:

Shopper accepts the assignment and navigates to the store. They mark “Start Shopping” in the app.
The Shopping Session Manager generates an optimized list sorted by store aisle layout for efficient picking.
As the shopper picks items, they scan barcodes. The system verifies against the order and updates progress in real-time to the customer.
When an item is out of stock, the shopper marks it unavailable. The Substitution Service immediately generates 2-3 alternatives based on similarity and customer history.
A push notification is sent to the customer with substitution options. They have 2 minutes to approve, reject, or request a refund.
If the customer doesn’t respond, the system applies their default preference (learned from past orders).
After all items are collected, the shopper proceeds to checkout and marks “Shopping Complete”.
The Delivery Service calculates the optimal route if handling multiple orders and provides turn-by-turn navigation.
Real-time GPS updates flow to customers via WebSocket connections, showing shopper location and ETA.
Upon arrival, the shopper verifies delivery with a photo and geolocation stamp. The Payment Service captures the final payment amount.

Step 3: Design Deep Dive

With the core functional requirements met, it’s time to dig into the non-functional requirements via deep dives. These are the critical areas that separate good designs from great ones.

Deep Dive 1: How do we handle real-time inventory synchronization across thousands of stores with different POS systems?

Managing accurate inventory across 50K stores using various POS systems (Square, NCR, SAP, proprietary systems) while preventing overselling is one of the most challenging aspects of the system.

Challenge 1: POS System Integration Complexity

Different stores use different POS systems with varying capabilities:

Some provide real-time webhooks for inventory changes.
Some only support periodic batch exports (FTP, API polling every 5-30 minutes).
Some require database replication or CDC (Change Data Capture).
Data formats vary widely (JSON, XML, CSV, proprietary formats).

Challenge 2: Preventing Overselling

Inventory can change between when a customer sees “In Stock” and when they checkout. Multiple customers might try to buy the last unit simultaneously.

Challenge 3: Handling Update Volume

With 500K products across 50K stores, even 1% of inventory changing per minute means 250K updates/minute or 4,000+ updates/second.

Solution: Integration Hub with Event-Driven Architecture

Store Integration Hub Design:

The Integration Hub sits between partner stores and our system, providing:

Adapter Pattern: For each POS system type, we implement a specific adapter that translates their data format into our normalized schema. Each adapter handles authentication, data fetching/receiving, format transformation, and error handling specific to that system.

Normalized Events: All adapters publish to a Kafka topic called “inventory.updates” with a standard schema containing storeId, productId, quantity, timestamp, and source. Kafka partitions by storeId ensuring all updates for a store are ordered.

Deduplication: Use Redis to track processed update IDs with 1-hour TTL, preventing duplicate processing if a POS system sends the same update multiple times.

Integration Flow:

POS systems send updates via their native mechanism (webhook, file export, DB replication).
The appropriate adapter normalizes the data and publishes to Kafka.
Multiple Inventory Service instances consume from Kafka partitions in parallel.
Each update is deduplicated using Redis before processing.
Updates are written to PostgreSQL with optimistic locking (checking timestamp to ensure we don’t apply older updates).
Changed inventory is pushed to Redis cache with 5-minute TTL for fast reads.
Updates are indexed in Elasticsearch asynchronously for search availability.

Inventory Reservation System:

To prevent overselling, we implement a reservation mechanism:

When a customer starts checkout, the Inventory Service attempts to reserve items:

For each item, it uses SELECT FOR UPDATE in PostgreSQL to lock the inventory row.
It checks if available quantity (total - reserved) is sufficient.
If yes, it increments the reserved_quantity and creates a Reservation record with a 10-minute expiration.
If no, checkout fails and the customer is notified.

The reservation is temporary to handle abandoned carts. A background job runs every minute to release expired reservations (checking expiration timestamp) and decrement reserved_quantity.

When an order is confirmed and assigned to a shopper, the reservation converts to a “committed” state. When shopping completes, the actual quantity is decremented based on what was purchased.

Predictive Inventory:

To improve customer experience, we use machine learning to predict stockouts before they happen:

A model analyzes historical data (current stock level, average daily sales, day of week, upcoming holidays, recent trend) to predict stockout probability and estimated hours until stockout. If probability exceeds 70% and time is under 4 hours, the product is marked “Low Stock” in the catalog to set customer expectations. This reduces frustration from out-of-stock situations during shopping.

Deep Dive 2: How do we optimize shopper-order matching with batch picking for maximum efficiency?

Efficient matching is critical for unit economics. Batch picking (one shopper handling 2-3 orders simultaneously) significantly improves efficiency but adds complexity.

Challenge: Batch Compatibility

Not all orders can be batched together. We need to consider:

Store proximity: Ideally same store, acceptable if stores are close.
Delivery location clustering: Deliveries should be in similar areas.
Time window overlap: Delivery windows must overlap by at least 60 minutes.
Cart size compatibility: Total items shouldn’t exceed 80 for manageability.
Route efficiency: Actual delivery route must be reasonable.

Solution: Multi-Factor Scoring Algorithm

Batch Compatibility Scoring:

When a new order comes in, the Matching Engine evaluates whether it can be added to active shoppers’ existing batches:

Store Score: If all orders (existing + new) are from the same store, score 1.0. If from different stores within 1 mile, score 0.7. Otherwise 0.3.

Location Score: Calculate the maximum distance between any two delivery addresses in the batch. Use a scoring function: max(0, 1 - distance/10 miles). Batches with deliveries more than 10 miles apart score near zero.

Time Window Score: Calculate overlap between all orders’ delivery windows. If overlap exceeds 60 minutes, score 1.0. Less overlap gets lower scores. No overlap scores 0.0.

Size Score: Sum total items across all orders. If under 60 items, score 1.0. Between 60-80 items, score 0.7. Over 80 items, score 0.3 (getting unwieldy).

Route Score: Run a quick TSP (Traveling Salesman Problem) approximation to estimate the optimized delivery route. Compare total route time to sum of individual routes. An efficiency ratio of 0.7 means the batch takes 70% of the time compared to individual trips, which is excellent.

Overall Compatibility: Weighted average of these scores: 0.2×store + 0.25×location + 0.25×time + 0.15×size + 0.15×route. A threshold of 0.7 indicates good batch compatibility.

Shopper Selection Algorithm:

For each new order, the Matching Engine:

Queries Redis GEO commands to find all shoppers within 5 miles of the store.
Filters by availability (online, not on break) and capacity (fewer than 3 active orders).
For each candidate, calculates a multi-factor score:

Distance Score: Closer shoppers score higher. Use max(0, 1 - distance/5 miles) formula.

Performance Score: Weighted combination of shopper rating (out of 5), acceptance rate, and completion rate. This ensures reliable shoppers are preferred.

Batch Opportunity Score: If the order can be batched with the shopper’s active orders (compatibility > 0.7), calculate the potential earnings per hour for the batch. Higher earnings mean better utilization. This score can be substantial (up to 0.3 bonus) to strongly prefer batching.

Time Score: Compare estimated completion time to the customer’s desired delivery window. Orders delivered within the window score higher.

Fairness Score: Track time since the shopper’s last order assignment. Shoppers who haven’t received orders recently get a boost (up to 2 hours of boost time). This prevents high-rated shoppers in optimal locations from monopolizing orders while others sit idle.

Overall Score: Weighted sum with batch opportunities getting significant weight.

Rank shoppers by total score descending.
Attempt to acquire a distributed lock on the top shopper.
If successful, send notification and wait for response.
If declined or timeout, release lock and try the next shopper.

Distributed Locking for Atomic Assignment:

Use Redis SET with NX and EX flags: SET lock:shopper:{shopperId} {orderId} EX 30 NX

This atomically sets the lock only if it doesn’t exist and sets a 30-second expiration. This prevents race conditions where two orders try to assign the same shopper simultaneously.

Deep Dive 3: How do we manage real-time communication for substitutions and maintain shopping session state?

Real-time substitution approval is critical for customer satisfaction but introduces complexity with human-in-the-loop latency.

Challenge: Low-Latency Bidirectional Communication

Traditional request-response HTTP is insufficient for real-time updates. We need instant delivery of:

Shopper location updates to customers.
Substitution requests to customers.
Customer approvals to shoppers.
Item scanning progress to customers.

Challenge: Handling Timeouts and Fallbacks

Customers may not respond to substitution requests within reasonable timeframes (network issues, didn’t see notification, away from phone). We can’t have shoppers waiting indefinitely.

Solution: WebSocket Connections with Timeout Workflows

WebSocket Architecture:

When an order enters “SHOPPING” status, establish WebSocket connections:

Customer Client connects to Communication Service with orderId in connection metadata.
Shopper Client connects with shopperId in connection metadata.
Communication Service maintains an in-memory map of active connections.

Substitution Workflow:

Shopper marks an item out-of-stock in the app.
The Inventory Service updates stock status based on this real-world feedback.
The Substitution Service query runs:
- Fetch similar products based on category, brand, and attributes.
- Apply collaborative filtering (what did other customers accept?).
- Filter by price (substitutes should be similar or lower price, or clearly marked if higher).
- Rank by similarity score.
Top 2-3 suggestions are packaged into a SubstitutionRequest and sent to customer via WebSocket.
Simultaneously, a push notification is sent via APN/FCM in case the customer doesn’t have the app open.
A timeout timer starts (2 minutes).
Three possible outcomes:

Customer Approves: The selected substitute is added to the order, original item is marked replaced, and the shopper receives immediate notification via WebSocket with the substitute’s aisle location.

Customer Rejects/Requests Refund: The item is removed from the order, price is adjusted, and the shopper is notified to skip it.

Timeout Expires: The system looks up the customer’s default substitution preference from their profile (learned from past orders). Common patterns include “auto-approve similar items within 10% price”, “always refund”, or “prefer specific brands”. The appropriate action is taken automatically.

Session State Management:

Shopping sessions are stateful (items scanned, substitutions made, current progress) but must be durable in case of service restarts:

Active session state is kept in Redis with frequent updates (sub-second latency for reads/writes).
Every 30 seconds, session checkpoints are written to PostgreSQL for durability.
If a Communication Service instance crashes, the shopper reconnects to another instance which loads state from Redis.
WebSocket reconnection uses exponential backoff and session resumption tokens to avoid dropping updates.

Event Sourcing for Audit Trail:

All shopping events (item scanned, item marked unavailable, substitution requested, substitution approved) are published to Kafka. This provides:

Complete audit trail for dispute resolution.
Analytics on substitution patterns.
Ability to replay sessions for debugging.
Training data for ML models.

Deep Dive 4: How do we optimize delivery routes for batch orders and provide accurate ETAs?

When a shopper handles multiple orders, route optimization becomes critical. Delivering to three customers 5 miles apart in the wrong order could add 20+ minutes of unnecessary driving.

Challenge: Traveling Salesman Problem with Constraints

Finding the optimal delivery sequence is a variant of the Traveling Salesman Problem (TSP), which is NP-hard. Additionally, we have constraints:

Time windows: Each delivery must arrive within the promised 2-hour window.
Shopping time: Shopper hasn’t even started shopping yet when initial ETA is calculated.
Real-time traffic: Conditions change throughout the journey.

Solution: Constrained Optimization with Real-Time Updates

Route Optimization Algorithm:

When a batch assignment is confirmed (shopper accepts multiple orders), the Delivery Service:

Builds a graph with the store as starting point and all delivery addresses as nodes.
Queries Google Maps Distance Matrix API to get drive times between all pairs of locations, accounting for current traffic conditions.
Applies a constraint solver (using OR-Tools or similar) to solve the Capacitated Vehicle Routing Problem with Time Windows (CVRPTW).

The solver considers:

Minimize total drive time.
Respect delivery time windows (soft constraint with penalties).
Account for service time at each stop (5 minutes for parking and delivery).
Ensure the route is feasible within shopper’s available time.

Returns an optimized sequence like: Store -> Delivery A -> Delivery C -> Delivery B

ETA Calculation:

Accurate ETAs require combining multiple time components:

Shopping Time: Estimated based on number of remaining items and shopper’s historical speed. A machine learning model trained on past shopping sessions predicts time based on item count, store layout, time of day, and shopper experience level. Updated in real-time as items are scanned.

Checkout Time: Average 5-7 minutes based on store and time of day. Longer during peak hours with register queues.

Drive Time: Fetched from Google Maps API with current traffic. Updated every 2-3 minutes during delivery with fresh traffic data.

Delivery Time: 5 minutes per stop for parking, walking, and handoff. More for apartment buildings with complex access.

Batch Delay: If order is second or third in the batch, add time for previous deliveries.

The total ETA is: shopping_time + checkout_time + drive_time_to_customer + sum(delivery_times_before_this_customer) + this_delivery_time.

Real-Time ETA Updates:

As the shopper progresses:

Every 30 seconds during shopping, recalculate shopping_time based on items remaining.
Every 2 minutes during delivery, refetch drive times with current traffic.
If ETA changes by more than 5 minutes, push update to customer via WebSocket.
Display confidence intervals to customers: “Arrives between 2:15 PM - 2:30 PM” rather than false precision.

Dynamic Rerouting:

If traffic conditions change dramatically (accident, road closure), or if a delivery is unexpectedly fast or slow:

Recalculate the optimal route for remaining deliveries.
If reordering deliveries significantly improves time, suggest to shopper.
Always respect time window constraints - never reorder in a way that causes late deliveries.

Deep Dive 5: How do we dynamically manage delivery slot capacity to maximize throughput while meeting delivery promises?

Delivery slots are a constrained resource. Overcommitting leads to late deliveries and poor customer experience. Undercommitting leaves capacity unused and reduces revenue.

Challenge: Predicting Shopper Availability

Shopper availability varies by:

Time of day (more shoppers available during evenings and weekends).
Day of week (weekends see 2-3x more shoppers).
Geographic area (urban areas have more dense shopper pools).
Weather conditions (severe weather reduces availability).
Special events (concerts, sports games affect both demand and supply).

Challenge: Balancing Batch Efficiency with Capacity

Batch picking improves efficiency but reduces capacity if we wait too long to form batches. We need to balance:

Waiting for good batch matches (higher efficiency).
Quickly assigning orders to meet delivery windows (higher capacity).

Solution: Dynamic Capacity Management with Predictive Modeling

Slot Capacity Calculation:

When a customer requests available delivery slots, the Delivery Service:

Queries historical data to predict available shoppers for each future time slot:
- Use a time series model trained on past shopper availability.
- Factor in day of week, time of day, season, weather forecast, and local events.
- Output: Expected number of active shoppers per hour per delivery zone.
Estimate batch efficiency:
- Historical data shows average batch size for this zone and time (e.g., 1.8 orders per shopper on average).
- Better batching during high-density times in urban areas (2.5+ orders per shopper).
- Lower batching in suburban areas with spread-out deliveries (1.2 orders per shopper).
Calculate base capacity: predicted_shoppers × avg_batch_size
Apply utilization factor (typically 0.85) to leave buffer for:
- Shoppers declining orders.
- Longer-than-expected shopping times.
- Traffic delays.
- System resilience.
Check current bookings for each slot and calculate remaining capacity.
Apply surge pricing to high-demand slots:
- If slot is more than 85% full, apply surge multiplier (1.2-2.0x).
- Display to customers as “High demand - $X additional fee”.
- Revenue from surge both increases profit and signals to shoppers (higher earnings attract more shoppers online).

Dynamic Capacity Adjustments:

Throughout the day, a background job monitors:

Actual shopper availability vs predictions (calibration).
Order completion rates and times.
Batch formation success rates.

If actual availability exceeds predictions, open more slots for upcoming time windows. If below predictions, close slots or increase surge pricing to throttle demand.

Slot Release Strategy:

Slots are released on a rolling basis:

Standard slots available 2-3 days in advance.
Express 1-hour slots released only 2 hours ahead when shopper density is confirmed.
During high-confidence periods (historical patterns), release slots further in advance.
During uncertainty (new market, unusual events), be more conservative.

Deep Dive 6: How do we ensure order consistency and prevent partial failures during high-scale operations?

At 10M+ orders per day, even a 0.1% failure rate means 10,000 customers with broken experiences daily. We need robust failure handling.

Challenge: Distributed Transaction Coordination

Order placement involves multiple services:

Inventory Service reserving items.
Payment Service authorizing payment.
Order Service creating the order record.
Matching Engine queueing for assignment.

If any step fails, we need to rollback previous steps. But distributed transactions (2PC) are slow and can deadlock at scale.

Challenge: Idempotency

Network issues can cause retries. If a customer clicks “Place Order” and it times out, they might click again. We need to ensure this doesn’t create duplicate orders.

Solution: Saga Pattern with Idempotency Keys

Saga Pattern Implementation:

Instead of a distributed transaction, we use the Saga pattern with compensating transactions:

Generate a unique order ID and idempotency key from the client request (or server-generated and returned to client).
Order Service attempts to create the order record with status “PENDING”.
- If duplicate idempotency key detected, return existing order (idempotent).
Inventory Service attempts reservation.
- Success: Continue to next step.
- Failure: Mark order as “FAILED_INVENTORY” and return error to customer immediately.
Payment Service attempts authorization.
- Success: Continue to next step.
- Failure: Trigger compensation - release inventory reservations, mark order “FAILED_PAYMENT”, return error.
Update order status to “PLACED” and publish to Kafka for matching.
If any step times out, retry with exponential backoff up to 3 attempts.
If retries exhausted, trigger full compensation chain.

Compensation Logic:

Each service implements compensation endpoints:

Inventory Service: POST /inventory/reservations/:id/release
Payment Service: POST /payments/authorizations/:id/void

If order placement fails at any stage after inventory reservation, the Order Service calls compensation endpoints to clean up.

Kafka for Reliable Event Processing:

After order is successfully placed, an “order.placed” event goes to Kafka:

At-least-once delivery guarantee ensures events aren’t lost.
Consumer groups with multiple Matching Engine instances process in parallel.
Each consumer tracks its offset; if it crashes, another picks up from last committed offset.
Exactly-once semantics (using transactions) prevent duplicate processing.

Dead Letter Queue for Failure Handling:

If matching repeatedly fails (all shoppers decline, no shoppers available):

After 3 attempts, publish to a Dead Letter Queue (DLQ).
Operations team monitors DLQ and can manually intervene.
Automated retry with surge pricing to attract shoppers.
Eventually (after 30 minutes), cancel order and refund customer with apology and discount coupon.

Circuit Breaker Pattern:

To prevent cascading failures:

If Payment Service is experiencing high error rates, open a circuit breaker.
Subsequent order placement attempts fail fast with clear error messages.
Allow the Payment Service to recover rather than overwhelming it with retries.
After a timeout period, attempt half-open state (let some requests through to test recovery).

Step 4: Wrap Up

In this design, we proposed a comprehensive system for a grocery delivery platform like Instacart. If there is extra time at the end of the interview, here are additional points to discuss:

Summary of Key Components:

We designed a production-grade system with the following core services:

Catalog Service: Product search and browsing with Elasticsearch, Redis caching, and CDN for images.
Inventory Service: Real-time synchronization from diverse POS systems, predictive stockouts, and reservation management preventing overselling.
Cart Service: Session-based and persistent carts with real-time validation.
Order Service: Order lifecycle management with saga pattern for consistency.
Shopper Service: Profile management, location tracking, and performance metrics.
Matching Engine: Intelligent batch-aware order-shopper assignment with geospatial queries and multi-factor scoring.
Shopping Session Manager: Real-time shopping coordination with barcode scanning and progress tracking.
Substitution Service: ML-powered recommendations with customer approval workflows.
Delivery Service: Route optimization and real-time ETA calculation with traffic data.
Communication Service: WebSocket-based real-time bidirectional messaging.

Key Technical Decisions:

Data Store Selection:

PostgreSQL for transactional data requiring ACID properties (orders, inventory, payments).
Redis for caching hot data, session state, geospatial queries, and distributed locking.
Elasticsearch for full-text product search with filtering and autocomplete.
Kafka for event streaming and reliable async communication between services.
S3/CDN for product images and delivery proof photos.

Consistency Models:

Strong consistency for inventory reservations (SELECT FOR UPDATE in PostgreSQL).
Eventual consistency acceptable for catalog updates and analytics.
Saga pattern for distributed transactions without 2PC overhead.
Idempotency keys for handling retries and network failures.

Scalability Approach:

Horizontal scaling of all stateless services.
Database partitioning by storeId for inventory tables.
Redis clustering for distributed caching and geospatial data.
Kafka partitioning for parallel event processing.
CDN for global image delivery.

Real-Time Communication:

WebSocket connections for low-latency bidirectional messaging.
Push notifications (APN/FCM) for reaching customers when app is backgrounded.
Redis Pub/Sub for broadcasting updates to multiple WebSocket servers.

Additional Features to Consider:

Personalization and Recommendations:

Machine learning models for personalized product recommendations based on purchase history.
Automatic reorder suggestions for frequently purchased items.
Dynamic pricing based on customer segments and price elasticity.

Advanced Substitution Intelligence:

Computer vision to verify product quality (produce freshness, packaging damage).
Dietary restriction enforcement (allergies, religious requirements).
Brand preference learning from implicit feedback.

Operational Excellence:

Comprehensive monitoring with Prometheus/Grafana for real-time metrics.
Distributed tracing with Jaeger to track requests across services.
Centralized logging with ELK stack for debugging and analysis.
Automated alerts for SLA violations, high error rates, and capacity issues.
A/B testing framework for matching algorithms, pricing, and UI changes.

Future Enhancements:

Autonomous Delivery Integration:

Robotics for in-store picking (automated carts following shoppers).
Self-driving vehicles for delivery in supported areas.
Drone delivery for lightweight orders in permissible zones.

Vertical Expansion:

Pharmacy delivery with prescription management and insurance coordination.
Alcohol delivery with age verification (ID scanning).
Prepared meal delivery from partner restaurants.
Same-day general merchandise delivery.

International Expansion:

Multi-currency support with real-time exchange rates.
Localized catalogs with regional products and cultural preferences.
Compliance with regional regulations (GDPR in EU, CCPA in California).
Language localization for global markets.

Shopper Tools Enhancement:

Augmented reality for product location in unfamiliar stores.
Predictive batching suggesting shoppers position near high-demand stores.
Dynamic incentives for shoppers to come online during capacity shortages.

Bottlenecks and Mitigation:

Inventory Synchronization Lag:

Problem: POS updates can lag by 1-5 minutes, leading to customers seeing inaccurate stock.
Mitigation: Predictive inventory models, real-time shopper feedback, conservative stock thresholds, and clear communication of stock uncertainty to customers.

Matching Engine Latency:

Problem: Complex scoring across thousands of shoppers can take seconds.
Mitigation: Geospatial pre-filtering, pre-computed shopper scores cached in Redis, async batch compatibility checks, and parallel processing of candidate shoppers.

Substitution Approval Timeout:

Problem: Customers not responding within 2 minutes blocks shopping progress.
Mitigation: Learned default preferences, auto-approval for close matches within price range, and proactive notifications (push, SMS) to increase response rate.

Database Write Contention:

Problem: High-traffic stores cause hotspots in inventory updates.
Mitigation: Partitioning by storeId, optimistic locking instead of pessimistic, async replication to read replicas, and caching to reduce read load on primary.

WebSocket Connection Scaling:

Problem: Maintaining millions of concurrent WebSocket connections consumes server resources.
Mitigation: Dedicated WebSocket servers separate from API servers, connection pooling, Redis pub/sub for broadcasting, and automatic connection cleanup for inactive sessions.

This architecture provides a solid foundation for a grocery delivery platform that can scale to millions of users while maintaining high availability, low latency, and excellent customer experience. The design emphasizes real-time operations, intelligent automation, and robust failure handling - all critical for the complex logistics of on-demand grocery delivery.

Summary

This comprehensive guide covered the design of a grocery delivery platform like Instacart, including:

Core Functionality: Product search and browsing, cart management, order placement, intelligent shopper matching with batch optimization, real-time shopping with substitutions, and delivery tracking.
Key Challenges: Real-time inventory synchronization across diverse POS systems, preventing overselling with strong consistency, efficient shopper-order matching considering batch opportunities, real-time substitution approval workflows, route optimization for multi-stop deliveries, and dynamic delivery slot capacity management.
Solutions: Event-driven architecture with Kafka for inventory updates, adapter pattern for POS integration, reservation system with optimistic locking, geospatial queries with Redis for proximity searches, multi-factor scoring algorithm for batch-aware matching, WebSocket connections for real-time communication, constraint solving for route optimization, and predictive modeling for capacity planning.
Scalability: Microservices architecture with horizontal scaling, database partitioning, Redis clustering, Elasticsearch for distributed search, Kafka for event streaming, CDN for static assets, and saga pattern for distributed consistency without 2PC overhead.

The design demonstrates how to handle complex logistics operations at scale with real-time requirements, human-in-the-loop workflows, and strong consistency needs balanced with high availability.

Design Instacart