Skip to content

01 — RAG with Bedrock + OpenSearch Serverless

Retrieval-augmented generation over a private corpus, using Amazon Bedrock Knowledge Bases and OpenSearch Serverless as the vector store.

Problem statement

Your users need to ask natural-language questions over an internal document corpus (policies, runbooks, contracts, product docs) and get answers grounded in that corpus, with citations. You want this without standing up a vector DB cluster, without managing embeddings yourself, and with as little plumbing as possible.

Components

  • Amazon Bedrock — Foundation Models. Claude Sonnet for synthesis. Titan Text Embeddings for vectorising chunks. Both managed; no model hosting.
  • Amazon Bedrock — Knowledge Bases. Owns ingestion, chunking, embedding, retrieval. Eliminates ~70% of the plumbing you'd otherwise write.
  • Amazon OpenSearch Serverless (vector collection). Vector store with hybrid (BM25 + KNN) search. Auto-scales OCUs.
  • Amazon S3. Source-of-truth bucket for the documents — KB watches a prefix and re-ingests on change.
  • Amazon API Gateway (REST) + AWS Lambda. Thin orchestration layer in front of RetrieveAndGenerate.
  • Amazon Cognito (optional). User authentication + per-user audit context.
  • Amazon CloudWatch. Logs, metrics, and embedded-metric-format custom metrics for citations-per-query and latency.
  • AWS IAM. Least-privilege roles per Lambda; KB has a dedicated role with read-only on the S3 prefix and write on the vector index.

Diagram

flowchart LR
    User((User)) --> APIG[API Gateway]
    APIG --> Cognito[Cognito Authorizer]
    APIG --> Lambda[Lambda - orchestrator]
    Lambda -->|RetrieveAndGenerate| KB[Bedrock Knowledge Base]
    KB --> OS[(OpenSearch Serverless
vector collection)] KB -->|read documents| S3[(S3 - corpus)] KB -->|embed + synth| Bedrock[Amazon Bedrock
Titan + Claude] Lambda --> CW[CloudWatch Logs + Metrics] Lambda --> User

Decisions

D1 — Bedrock Knowledge Bases instead of self-managed RAG

Context. Building chunking + embedding + retrieval yourself is at least two weeks of work and ~600 lines of glue code (including incremental re-indexing on S3 changes).

Decision. Use Bedrock Knowledge Bases. Accept the constraints (chunking strategy is fixed-size + sentence-boundary, you pick a single embedding model per KB).

Alternatives. LangChain + pgvector on RDS (more flexibility, more ops). LlamaIndex + Pinecone (great DX, third-party data plane).

Consequences. Significantly faster shipping, less code to maintain, AWS-native IAM and observability. You give up some chunking flexibility — if your docs benefit from doc-aware chunking (markdown headings, code blocks), revisit.

D2 — OpenSearch Serverless over OpenSearch Provisioned

Context. Vector workload is bursty (re-ingestion during the day, sparse queries at night).

Decision. Use the Serverless option with a vector collection, minimum 2 OCUs (1 search + 1 indexing).

Alternatives. Provisioned (~30–40% cheaper at steady high load, but ops-heavy). Aurora pgvector (good for joint relational/vector queries — not our case).

Consequences. ~USD 350/mo floor for 2 OCUs even at zero traffic. In return, no cluster sizing or rebalancing decisions.

D3 — RetrieveAndGenerate server-side instead of Retrieve + manual synth

Context. We could call Retrieve ourselves and then InvokeModel separately.

Decision. Use RetrieveAndGenerate — KB owns the retrieve → synth dance, returning citations.

Alternatives. Manual split gives more control over prompt templates and lets you blend results across multiple KBs.

Consequences. Less control but one round trip and a canonical citation shape. Revisit if you need multi-KB or custom prompt engineering.

D4 — API Gateway REST instead of Function URL

Context. Single Lambda fronting the KB.

Decision. API Gateway REST with a Cognito authorizer and a usage plan.

Alternatives. Lambda Function URL (cheaper, simpler — no Cognito authorizer or per-user throttling).

Consequences. Slightly more cost (~USD 3.50/M requests) in exchange for auth, rate limits and a stable contract.

Cost analysis

Assumes us-east-1 on-demand. Embedding & synthesis pricing as of 2026-Q1.

Sizing Queries / mo Corpus Re-ingestion / mo Approx. monthly USD
S — pilot 10 000 100 MB 1× full ~ $430
M — team 200 000 5 GB 1× full + 10% incrementals ~ $880
L — org 2 000 000 50 GB weekly incrementals ~ $3 100

Inputs (M sizing):

  • OpenSearch Serverless: 2 OCUs × 730h × $0.24 = $350
  • Bedrock Claude Sonnet: 200k queries × (1k in + 0.5k out tokens avg) = 200M in + 100M out → ~$300
  • Bedrock Titan Embeddings: 5 GB × 1.5 (chunk overhead) / 3 chars/token = ~2.5M tokens → ~$2.50
  • API Gateway: 200k × $3.50/M = $0.70
  • Lambda: 200k × 300 ms × 512 MB = ~30k GB-s → free tier
  • S3: 5 GB × $0.023 = $0.12
  • CloudWatch: ~$25 logs/metrics
  • Misc (Cognito, data transfer): ~$200

Verify with the AWS Pricing Calculator before any commitment.

Well-Architected review

Operational excellence. Structured logs with request_id and kb_invocation_id; alarms on retrieval latency p99 and on citations-per-answer = 0 (sign of a bad query or empty corpus).

Security. KB role scoped to a single S3 prefix; vector collection is private (data access policies, not public). Cognito JWT validated at the edge. Bedrock content filters enabled on the model invocation profile.

Reliability. API Gateway + Lambda are inherently multi-AZ. OpenSearch Serverless auto-rebalances OCUs. The S3 source bucket is versioned so a bad ingestion is recoverable.

Performance efficiency. Cold-start budget on the Lambda is ~600 ms; mitigated by provisioned concurrency on the hot Lambda when p99 latency targets demand it.

Cost optimization. OCU floor dominates at low traffic — consolidate multiple KBs in a single collection if you can. Move from Claude Sonnet to Haiku for routing or summarization sub-prompts.

Sustainability. Serverless throughout; idle capacity is minimised.

Trade-offs

Use this when:

  • Corpus is mostly text (Markdown, PDF, HTML, Word, simple tables).
  • Queries are natural language, retrieval is the main lever, and you want citations.
  • Team is small and ops capacity is limited.

Do NOT use this when:

  • You need joint relational + vector queries → Aurora pgvector or pgvector on RDS.
  • Documents are highly structured (forms, schemas) — a structured-query layer + embeddings hybrid will outperform.
  • Latency budget is < 1 s end-to-end at p99 — RetrieveAndGenerate typically lands at 1–3 s.
  • You need cross-KB blending or custom prompt templates per tenant — use Retrieve + manual synth instead.

Terraform skeleton

See terraform/. The skeleton creates the S3 bucket, the KB, the vector collection, the data access policies, and a starter Lambda. Naming, tags, IAM boundaries and state backend are intentionally omitted.