import HeaderLink from './HeaderLink.astro';

Multi-step Processes

A comprehensive guide to managing complex multi-step processes using sagas, workflow systems, and durable execution engines...

Building reliable multi-step processes in distributed systems represents one of the most challenging aspects of modern software engineering. While idealized architectural diagrams show clean flows of sequential steps, real-world systems must contend with network failures, service timeouts, partial completions, server crashes, and the messy complexity of coordinating dozens of services to fulfill a single user request. This blog post explores the patterns, tools, and architectural approaches that enable building robust, long-running processes that can survive failures and continue where they left off.

The Challenge of Multi-step Coordination: Consider an e-commerce order fulfillment workflow. A customer places an order, triggering a sequence of operations: charge the payment method, reserve inventory from the warehouse, generate a shipping label, wait for a warehouse worker to pick and pack the item, send a confirmation email to the customer, and finally wait for the shipping carrier to pick up the package. Each step involves calling different services—a payment gateway, an inventory management system, a shipping provider’s API. Any of these services might fail, timeout, or return ambiguous results. The payment gateway might take thirty seconds to respond. The inventory system might discover the item is out of stock only after the payment succeeds. A warehouse worker might take hours to pick the item.

During this orchestration, your application servers will certainly be deployed, possibly multiple times. They might crash due to bugs, infrastructure failures, or scaling events. Network partitions might cause requests to timeout even though they succeeded. You might discover halfway through that you need to compensate for earlier steps—refunding a payment because inventory wasn’t available, or releasing an inventory reservation because shipping label generation failed. The supposedly simple flowchart of “do A, then B, then C” quickly becomes a tangled web of error handling, retry logic, timeout management, and state tracking scattered across your codebase.

The naive approach—orchestrating everything from a single server in a single request handler—fails immediately. What happens when the server crashes after charging the payment but before reserving inventory? When it restarts, it has no memory of the partial progress. The customer has been charged but will never receive their order. You might try adding database checkpoints after each step, persisting state so restarted servers can continue where crashed servers left off. But now you’re manually building a state machine with careful checkpoint logic, and you still haven’t solved critical problems like compensation for failures, routing callbacks from external services to the right server instance, or scaling beyond a single server.

The Organic Evolution Toward Complexity: Teams often patch these problems incrementally, creating increasingly complex systems. They add retry logic around each service call with exponential backoff. They implement compensating transactions for every possible failure scenario—refund the payment if inventory fails, release the inventory if shipping fails, cancel the shipping label if the warehouse worker can’t find the item. They use message queues to handle callbacks from external services. They build custom database tables to track the state of in-flight orders with status columns like “payment_pending,” “inventory_reserved,” “awaiting_pickup.” They write cron jobs to find and clean up orphaned transactions that got stuck in intermediate states.

Each patch solves an immediate problem but increases overall system complexity. Business logic becomes intertwined with infrastructure concerns. The code that knows “charge payment then reserve inventory” is scattered across multiple services, database tables, queue consumers, and background jobs. Adding a new step to the workflow requires changes in multiple places. Debugging why an order got stuck requires tracing through logs across different services, examining database state at various checkpoints, and understanding subtle timing interactions. The system works, barely, but it’s brittle, hard to modify, and a constant source of production incidents.

Event Sourcing as a Foundation: A more principled approach to multi-step processes is event sourcing, which provides a foundation for reliable orchestration. Instead of storing current state—“this order is in the ‘inventory_reserved’ state”—event sourcing stores the complete sequence of events that occurred: “OrderPlaced,” “PaymentCharged,” “InventoryReserved,” “ShippingLabelCreated.” The current state can always be reconstructed by replaying these events. More importantly, events become the coordination mechanism. Each service consumes events relevant to its domain, performs its work, and emits new events that trigger subsequent steps.

The architecture centers on a durable event log, often implemented with Kafka or Redis Streams. When a customer places an order, the API service writes an “OrderPlaced” event to the log and immediately returns a response to the customer. The order is now being processed asynchronously. A payment service worker constantly monitors the log for “OrderPlaced” events. When it sees one, it calls the payment gateway. When the payment gateway responds (which might be seconds or minutes later, possibly via a webhook callback), the payment worker writes either “PaymentCharged” or “PaymentFailed” to the log. The inventory service worker monitors for “PaymentCharged” events and responds by attempting to reserve inventory, writing “InventoryReserved” or “InventoryFailed” to the log. Each step triggers the next through events.

This architecture provides several powerful properties. Fault tolerance emerges naturally—if a worker crashes, another worker picks up the events it was processing. The complete event log provides perfect observability into what happened and when. Adding new functionality is straightforward—deploy a new worker that consumes relevant events and emits its own. Scaling is simple—add more workers to handle increased load. The event log serves as both the system of record and the coordination mechanism.

However, event sourcing also introduces challenges. You’re now building and operating significant infrastructure: the event log itself, worker orchestration and deployment, monitoring to ensure events are being processed, and tooling to trace event lineages and debug issues. Determining why a particular order failed requires understanding the chain of events and which workers processed them. The business logic that was once in a single request handler is now distributed across multiple workers consuming different event types. Compensating for failures requires careful design—when “InventoryFailed” occurs, you need a worker that consumes it and emits “PaymentRefundInitiated.”

Workflow Systems and Durable Execution: What we really want is a way to write code that looks like the simple sequential orchestration we started with—do A, then B, then C—but with automatic handling of failures, retries, state persistence, and long-running waits. Workflow systems and durable execution engines provide exactly this. They allow developers to describe multi-step processes at a high level while the underlying engine handles the messy details of distributed execution.

These systems fundamentally provide durable execution: the ability to write code that can move between machines and survive system failures. When a server running a workflow crashes, another server can pick up the workflow and continue from the last successful step. The workflow might wait hours or days for external events—a customer signing a document, a human approving a request, a third-party API callback—without consuming resources. The entire execution history is preserved, providing perfect audit trails.

Durable Execution Engines: Durable execution engines, exemplified by Temporal (originally built at Uber as Cadence), allow developers to write workflows as code that looks remarkably like traditional sequential programming. You write a function representing the workflow, and the engine handles all the orchestration. The key insight is that workflow code runs in a special environment that guarantees deterministic execution. Given the same inputs and history of previous activity executions, the workflow always makes identical decisions.

The workflow function orchestrates activities, which are the individual steps that perform actual work. Activities can be retried, can fail, and can run on any worker machine. Workflows are deterministic decision-makers that coordinate activities. When an activity executes, its result is recorded in a durable history database. If a workflow worker crashes, another worker can replay the workflow from the beginning, but instead of re-executing activities, it uses the recorded results from history. This replay-based recovery is what enables workflows to survive any failure.

The programming model feels natural. You await activity results, make decisions based on those results, and handle errors with standard try-catch blocks. Behind the scenes, the engine is checkpointing state, managing retries, routing work to available workers, and ensuring the workflow can resume after any failure. Workflows can also wait for external signals—when a webhook arrives or a human completes a task, the signal resumes the workflow at the exact point it was waiting, even if that was days ago.

A typical deployment includes the workflow server for centralized orchestration and state tracking, a history database containing the append-only log of all workflow decisions and activity results, and worker pools where workflow orchestration code and activity execution code run. Workers can scale independently based on load. The history database becomes the source of truth for workflow state, and the deterministic replay property ensures workflows can always be resumed correctly.

Declarative Workflow Systems: Managed workflow systems like AWS Step Functions, Google Cloud Workflows, and Apache Airflow take a declarative approach. Rather than writing workflows as code, you define them as state machines or directed acyclic graphs using JSON, YAML, or domain-specific languages. You specify the steps, transitions between steps, error handling rules, and retry policies in a structured format. The workflow engine interprets this definition and orchestrates execution.

The declarative approach brings different tradeoffs. Workflow definitions can be visualized as diagrams, making them easier to understand at a glance. The constrained expressiveness can be an advantage—it’s harder to write overly complex workflows. Integration with cloud services is typically seamless, with native support for invoking Lambda functions, making API calls, or running container tasks. The operational burden is minimal since these are managed services.

However, declarative workflows are less expressive than code. Complex conditional logic, loops, and error handling can become verbose and difficult to express in JSON or YAML. You often find yourself writing Lambda functions or container tasks to hold business logic that doesn’t fit the declarative model, which fragments the workflow definition. For simple orchestration of a few cloud service calls, declarative workflows are excellent. For complex business processes with intricate decision-making, code-based approaches often prove more maintainable.

Choosing the Right Approach: The decision of whether to use workflows, and which type, depends on the specific characteristics of your multi-step process. For simple sequential operations where each step completes quickly and reliably, traditional synchronous orchestration with proper error handling might suffice. A single service that calls other services, handles errors, and returns a response works fine when failures are exceptional and the entire process completes in seconds.

When you start seeing patterns like needing to undo earlier steps if later steps fail, coordinating work across multiple services with different failure modes, or handling processes that span minutes to hours with human-in-the-loop steps, workflows become compelling. The threshold isn’t a specific number of steps or duration—it’s the complexity of failure handling and state management. If you find yourself manually building state machines with database status columns and background jobs to advance stuck processes, a workflow system will likely simplify your architecture.

For teams already invested in a cloud ecosystem, managed workflow services like Step Functions or Cloud Workflows offer the easiest path. They eliminate operational overhead and integrate naturally with existing cloud services. They work well for orchestrating cloud resources and handling standard patterns. For more complex business processes requiring sophisticated logic, or when you need maximum control and portability across infrastructure, Temporal provides a powerful foundation despite higher operational complexity.

Handling Workflow Evolution: One of the most challenging aspects of workflow systems is handling changes to workflow definitions while existing workflows are running. A loan approval workflow might be running for a customer, and halfway through its execution you need to deploy a new version that adds a compliance check. How do you update the workflow without breaking the in-flight execution?

Workflow versioning is the simplest approach. Deploy the new workflow definition as a separate version. Existing workflows continue running the old version to completion, while newly started workflows use the new version. This works well when you can tolerate both versions running simultaneously and don’t need the new behavior to apply retroactively. The downside is operational complexity of maintaining multiple versions and potentially long periods before old workflows complete.

Workflow migration provides a way to update running workflows in place. With declarative systems, you can sometimes update the definition directly as long as changes don’t break assumptions about past execution. With code-based systems like Temporal, you use conditional logic to patch behavior—workflows that have already passed the patched point use old behavior, while workflows that haven’t yet reached it use new behavior. This requires careful design to ensure deterministic replay still works correctly.

State Size Management: Durable execution engines persist the complete history of workflow execution. Every activity execution, its inputs, its outputs, and the decisions made based on those outputs are recorded. This enables the replay-based recovery that makes workflows resilient, but it also means history can grow large for long-running workflows with many steps. Some workflow systems impose size limits on history, and even without hard limits, large histories impact performance.

The primary mitigation is keeping activity inputs and outputs small. Pass identifiers that can be looked up in databases or external systems rather than full objects. If an activity needs customer data, pass the customer ID rather than the complete customer record. The activity can fetch full data when it executes, and only the ID is recorded in history. For workflows that run indefinitely or accumulate large histories, periodic recreation can help—create a new workflow instance with just the current state needed to continue, allowing the old workflow with its large history to complete.

Ensuring Idempotency: A critical property of activities in workflow systems is idempotency—the ability to be safely retried with the same inputs. Workflow engines guarantee at-least-once execution of activities: if an activity succeeds but the acknowledgment back to the engine is lost due to a network failure, the engine will retry the activity. For operations like sending emails or processing payments, executing twice is problematic.

The solution is designing activities to detect and handle duplicate executions. Before performing irreversible actions, check whether the action has already been performed using an idempotency key stored in a database. For a payment activity, store a record with the idempotency key when initiating the payment. On retries, check if a payment with that key already exists before initiating a new one. This transforms at-least-once execution into effectively-once semantics from the business perspective.

External Events and Long Waits: Many workflows need to wait for external events that might take seconds, hours, or days. A document signing workflow might wait up to thirty days for a customer to sign. A manufacturing workflow might wait for a human operator to confirm completion of a physical process. Naive implementations that poll for status waste resources and add latency.

Workflow systems handle this efficiently through signals or external events. A workflow can suspend execution while waiting for a signal, consuming no resources. When the external event occurs—a webhook callback, a human clicking “approve” in a UI—the signal is delivered to the workflow engine, which resumes the workflow exactly where it was waiting. This pattern elegantly handles integration with external systems, human-in-the-loop processes, and any scenario where the workflow must wait for something outside its control.

When Workflows Aren’t the Answer: Despite their power, workflow systems aren’t appropriate for every scenario. Simple request-response APIs where clients wait synchronously for results don’t need workflows. Single-step asynchronous operations like resizing an image or sending an email are better handled with simple message queues. High-frequency, low-value operations where the overhead of workflow orchestration exceeds the value of reliability guarantees should use lighter-weight approaches.

The key is recognizing when the problems workflows solve—managing partial failures, coordinating multiple services, handling long-running processes, maintaining audit trails, implementing compensating actions—are actually problems your system faces. Don’t introduce workflow complexity for its own sake. Start with simpler approaches and adopt workflows when you find yourself manually building the state management and orchestration they provide.

Practical Application Patterns: Workflow systems shine in several common scenarios. Payment processing systems benefit enormously from workflows because they must coordinate multiple services, handle complex failure scenarios with compensating transactions, and maintain perfect audit trails for regulatory compliance. The workflow clearly expresses the business logic—attempt payment, on failure notify the customer, on success reserve inventory—while the engine handles retries, state persistence, and recovery.

Human-in-the-loop processes where systems must wait for people to complete tasks map naturally to workflows. Approval workflows, document signing, manual review processes all involve waiting for unpredictable durations while maintaining context about what needs to happen next. The workflow waits efficiently for signals indicating human action completion, then proceeds with subsequent automated steps.

Order fulfillment and similar multi-stage business processes that coordinate across organizational boundaries—inventory, shipping, customer service—benefit from the clear orchestration and observability workflows provide. The workflow becomes the source of truth for order state, and the event history provides complete visibility into what happened and why.

Multi-step processes in distributed systems present some of the most challenging engineering problems, but workflow systems and durable execution engines provide powerful tools for managing this complexity. They allow expressing business logic naturally while automating the difficult aspects of distributed coordination: state persistence, failure recovery, retry management, long-running operations, and compensating transactions. Success comes from recognizing when the problems you’re solving manually are exactly the problems these systems are designed to handle, choosing the right level of workflow sophistication for your needs, and resisting the temptation to overcomplicate solutions that don’t require workflow-level orchestration. Understanding these patterns enables building reliable systems that gracefully handle the messy reality of distributed computing.