Skip to content

02 — Multi-agent orchestration with Bedrock Agents + Step Functions

Long-running, multi-step workflows where multiple specialized agents coordinate through a durable state machine.

Problem statement

A single LLM call is not the right shape for a workflow that takes minutes to hours, calls multiple external systems, may need human approval in the middle, and must survive a Lambda restart. You need durable orchestration with specialized agents as the workers.

Concrete example: a "research report" workflow that (1) gathers sources via a search agent, (2) summarizes each source via a summary agent, (3) drafts a report via a writer agent, (4) gates on human approval, (5) publishes.

Components

  • AWS Step Functions Standard. The control plane — durable, visible, supports human-approval tasks via waitForTaskToken.
  • Amazon Bedrock Agents. Each step is an agent invocation. Each agent has its own action groups and prompt template.
  • AWS Lambda. Adapters between Step Functions tasks and Bedrock agent invocations; also implements action groups.
  • Amazon DynamoDB. Workflow state and intermediate artifacts (URLs, summaries, draft sections).
  • Amazon S3. Final artifacts and any large intermediate payloads (Step Functions payload limit is 256 KB).
  • Amazon EventBridge. Cross-workflow events (completion, failure, approval requested).
  • Amazon SNS / SES. Human approval notifications.

Diagram

flowchart TB
    Start([Trigger]) --> SF[Step Functions Standard]
    SF --> A1[Task: Search Agent]
    A1 --> A2[Task: Summary Agent fan-out]
    A2 --> A3[Task: Writer Agent]
    A3 --> Approval{Human approval?}
    Approval -->|Wait for token| Notify[SNS/SES → reviewer]
    Notify --> Approval
    Approval -->|Approved| Publish[Task: Publish]
    Approval -->|Rejected| Revise[Task: Revise Agent]
    Revise --> A3
    Publish --> Done([End])
    SF -.workflow state.-> DDB[(DynamoDB)]
    A1 & A2 & A3 & Revise -.invoke.-> Agents[Bedrock Agents]
    Agents --> KB[(Knowledge Bases)]

Decisions

D1 — Step Functions Standard, not Express

Context. Workflow can run for minutes to hours. Express has a 5-minute hard cap.

Decision. Standard. Pay-per-state-transition is fine for low-volume workflows; Express savings only matter at high event rate.

Alternatives. Express + recursion via EventBridge — possible but adds complexity. Step Functions Standard wins for readability.

Consequences. $0.025/1k transitions can add up; budget early.

D2 — One agent per role, not one mega-agent

Context. Could have a single agent with all action groups. Or split per role.

Decision. Split per role: Search, Summary, Writer, Reviser. Each one has a tight system prompt, only the tools it needs, and its own evals.

Alternatives. One generalist agent — simpler config but mixes responsibilities and harder to evaluate.

Consequences. More IaC, but each agent is independently testable. Easier to swap or A/B test one role.

D3 — Human-in-the-loop via waitForTaskToken

Context. Editorial / compliance workflows need approval before publication.

Decision. Use Step Functions' waitForTaskToken pattern: emit SNS notification with token, reviewer hits a "approve" link that calls a small Lambda that calls SendTaskSuccess / SendTaskFailure.

Alternatives. Polling DynamoDB — wasteful. EventBridge ↔ Step Functions integration — viable but more moving parts.

Consequences. Tokens have a max wait of 1 year, which is plenty. Make sure the approval-Lambda authenticates the reviewer.

D4 — Intermediate artifacts in S3 with pointers in payload

Context. Step Functions payload limit is 256 KB. Sources, summaries and drafts blow past that easily.

Decision. Put artifacts in S3 (s3://bucket/workflow-id/...), pass S3 keys in the payload, dereference inside each task.

Alternatives. Just-in-time inline via DynamoDB — DynamoDB also has 400 KB item limit. Same problem, smaller bucket.

Consequences. Slight extra latency per task (S3 round trip). Worth it.

Cost analysis

Sizing Workflows / mo Tasks / workflow Approx. monthly USD
S — pilot 100 8 ~ $120
M — team 2 000 12 ~ $1 050
L — biz unit 20 000 16 ~ $8 200

Inputs (M sizing):

  • Step Functions: 2k × 12 = 24k transitions → free tier + ~$0.50
  • Bedrock Claude Sonnet across all agents: ~$600
  • Lambda: ~$30
  • DynamoDB on-demand: ~$20
  • S3 storage + requests: ~$10
  • Knowledge Base retrieval: ~$200
  • EventBridge / SNS: ~$5
  • Logs & misc: ~$185

Well-Architected review

Operational excellence. Step Functions execution history is gold for debugging — it's the equivalent of a free distributed trace. Tag every execution with workflow_id, tenant, version for filtering.

Security. Each agent's action-group Lambda has its own IAM role. The orchestrator role can states:StartExecution only on the specific workflow ARN. Approval Lambdas validate the reviewer identity via Cognito or signed URLs.

Reliability. Standard workflows survive any worker failure — Step Functions retries are first-class. Bedrock agent invocations are idempotent for the same sessionId.

Performance efficiency. Parallel summary stage uses Map state with MaxConcurrency set to a sane cap (e.g. 10) to stay under Bedrock TPS limits.

Cost optimization. For agents that don't need Claude Sonnet, drop to Haiku — savings stack across the workflow. Cache the search-agent results in DynamoDB with a TTL.

Sustainability. No idle compute — Step Functions and Lambda only run when invoked. Bedrock is multi-tenant on AWS's side.

Trade-offs

Use this when:

  • Workflow exceeds the 15-min Lambda cap or has human-in-the-loop gates.
  • You want each step's reasoning surfaced (Step Functions UI shows inputs/outputs per state).
  • Volume is hundreds to low-thousands of workflows per day.

Do NOT use this when:

  • Workflow is short (< 30 s) and stateless — call Bedrock directly from a single Lambda.
  • You need true sub-second latency on every step — Step Functions adds ~50–100 ms per state.
  • The pattern is fan-out only (no orchestration) — EventBridge + SQS is simpler. See arch 04.

Terraform skeleton

See terraform/ — creates the state machine, IAM, DynamoDB and S3. Agents are referenced by ID (you provision them separately or via the AWS console while iterating).