03 — Streaming AI inference¶

Token-level streaming responses from Bedrock to a browser chat UI, with the lowest possible time-to-first-token.

Problem statement¶

Users expect chat UIs to start streaming tokens within a second. A non-streaming InvokeModel call returns the full response in 3–8 s — too slow to feel responsive even when it's actually fast. We need to stream tokens end-to-end: Bedrock → backend → browser.

Components¶

Amazon CloudFront. TLS termination, edge caching for static assets, low-latency global delivery of the chat UI.
Amazon S3. Static hosting for the chat UI.
AWS Lambda with response streaming. Connects to InvokeModelWithResponseStream, transforms Bedrock's event stream into Server-Sent Events (SSE), and streams to the browser.
Function URL (with AWS_IAM or Cognito). Direct HTTPS endpoint that supports streaming; API Gateway REST does not support response streaming (yet).
Amazon Bedrock — InvokeModelWithResponseStream. Generates the token stream.
Amazon Cognito. Authentication; ID token validated by the streaming Lambda.
Amazon DynamoDB. Per-conversation memory and rate limits.
Amazon CloudWatch. Custom metrics for time-to-first-token (TTFT) and tokens/s.

Diagram¶

flowchart LR
    Browser((Browser)) -->|HTTPS| CF[CloudFront]
    CF --> S3[(S3 - chat UI)]
    Browser -->|SSE| FURL[Function URL]
    FURL --> Lambda[Lambda
response streaming]
    Lambda -->|InvokeModelWithResponseStream| Bedrock[Amazon Bedrock]
    Lambda --> Cognito[Cognito - validate JWT]
    Lambda --> DDB[(DynamoDB
conversation memory)]
    Lambda --> CW[CloudWatch
TTFT, tokens/s]

Decisions¶

D1 — Function URL with streaming, not API Gateway¶

Context. API Gateway REST does not support response streaming; HTTP API supports WebSocket but not server-sent events directly.

Decision. Lambda Function URL with RESPONSE_STREAM invoke mode. Auth via Cognito JWT validated inside the function.

Alternatives. API Gateway WebSocket — works but more complex client and harder caching. AppSync subscriptions — overkill.

Consequences. No edge throttling — implement rate limits in the Lambda (DynamoDB token bucket).

D2 — Server-Sent Events over WebSocket¶

Context. This is a one-way server-to-client stream. The client doesn't need to push tokens back.

Decision. SSE. Simple text/event-stream; browsers reconnect automatically; one open HTTP connection per session.

Alternatives. WebSocket — bidirectional capability we don't need, more state.

Consequences. SSE has fewer edge-case bugs and is trivial to test with curl.

D3 — CloudFront → S3 for the UI, Function URL separately (not behind CloudFront)¶

Context. Could front Function URL with CloudFront to get one domain. CloudFront has caveats with streaming responses.

Decision. Two endpoints: the UI on CloudFront, the API on the raw Function URL. CORS configured.

Alternatives. CloudFront + Lambda Function URL origin — supported but adds a tier and burns a CF distribution.

Consequences. Two DNS records and explicit CORS handling. Worth it.

D4 — Conversation memory in DynamoDB, not in the request body¶

Context. Clients could send the whole history every turn. Server could remember.

Decision. Server-side memory in DynamoDB. Client sends only the new message and a conversation_id.

Alternatives. Client-stored history — leaks the whole conversation across CORS/refreshes; bandwidth grows linearly.

Consequences. Need a TTL on the table; need a "clear" endpoint.

Cost analysis¶

Sizing	Conversations / mo	Avg tokens / turn	Approx. monthly USD
S — demo	1 000	800	~ $25
M — product	50 000	1 500	~ $640
L — scale	1 000 000	1 800	~ $15 200

Inputs (M sizing):

Bedrock Claude Sonnet: 50k × 1.5k = 75M tokens (~70% input, 30% output) → ~$500
Lambda streaming: 50k × 3 s × 512 MB = ~75k GB-s → ~$1.25
Function URL: free
DynamoDB on-demand: ~$15
CloudFront + S3: ~$10
Cognito: 50k MAU → first 50k free, then $0.0055/user → varies
CloudWatch logs + metrics: ~$15
Misc: ~$100

Well-Architected review¶

Operational excellence. Custom CloudWatch EMF metrics for ttft_ms, tokens_per_second, output_tokens. Alarm on p99 TTFT > 1500 ms.

Security. JWT validation in the Lambda; tight CORS; rate limits per Cognito sub. Bedrock model invocation profile with content filters.

Reliability. Function URLs are regional and multi-AZ. Bedrock throttling propagates upstream — Lambda must surface 429 cleanly so the client can back off.

Performance efficiency. Provisioned concurrency on the streaming Lambda for hot paths. Region-pin the deployment to where Bedrock latency is lowest for your users.

Cost optimization. Drop to Claude Haiku for short replies. Cache the system prompt via Bedrock prompt caching (75% discount on cached input tokens).

Sustainability. Serverless throughout; streaming reduces wasted retransmission of unused tokens (clients can abort mid-stream).

Trade-offs¶

Use this when:

The UI is chat or chat-like and TTFT matters.
Volume is bursty (Function URL scales linearly with no cold-start tax beyond Lambda's normal cold start).

Do NOT use this when:

You need server-side fan-out to multiple clients per conversation — SSE is point-to-point.
You need true bidirectional control (e.g. voice with interruption) — WebSocket + API Gateway WS is the pattern.
The response is structured JSON the client must validate atomically — streaming JSON is hard to handle on the browser.

Terraform skeleton¶

See terraform/ — Lambda with RESPONSE_STREAM invoke mode, Function URL with AWS_IAM, CloudFront + S3 for the UI.