distributed-transactionssaga

How to handle in Saga pattern implemented with execution coordinator, if a service completes its transaction, but does notify the next service


Consider order processing which consists of multiple micro-services each for:

  1. Create order.
  2. Make payment.
  3. Update inventory.
  4. Deliver order.

Let say saga execution coordinator is used to implement saga, where each of above is in own micro-service with their own persistence. If Step 2 - 'Making payment' was successful and start and end of transaction was notified to coordinator, however before make payment service could notify the update inventory, the make payment process instance went down. How the same can be handled?

Brainstorming best practice for a use-case of distributed transaction for design of greenfield system. Should the check be implemented on the coordinator side or it's better to implement multiple instances for each service behind a centralized queue, assuming each step is idempotent and retry able.


Solution

  • The problem arises from separating out SAGA and service communication. Use an orchestrator and this problem disappears as SAGA becomes part of the orchestration itself.

    For your example, here is how SAGA looks like using temporal.io open source project Java SDK:

        Saga saga = new Saga(sagaOptions);
        try {
          OrderOutput order = activities.createOrder(orderInput);
          saga.addCompensation(activities::cancelOrder, order);
    
          PaymentOutput payment = activities.makePayment(order);
          saga.addCompensation(activities::cancelPayment, payment);
    
          UpdateInventoryOutput inventoryUpdate = activities.updateInventory(order);
          saga.addCompensation(activities::cancelUpdateInventory, inventoryUpdate);
          
          DeliverOrderOutput orderDelivery = activities.deliverOrder(order);
        } catch (ActivityFailure e) {
          saga.compensate();
          throw e;
        }
    
    

    Note that while the above code looks like normal Java code, when executed through Temporal it ensures that it will eventually complete in the presence of various types of failures like process crashes and network outages. Temporal keeps the whole program state including local variables and stacks for long-running API calls in durable storage. That's why you don't need an application-level queue or DB to maintain the orchestration state.