Experiments in Agentic AI, Part VII: Replacing OpenAI with Bedrock
by Alex Harvey
Introduction
In the previous post, I rebuilt the RAG pipeline on Azure, demonstrating that the core architecture — ingestion, indexing, and retrieval — is portable across cloud providers. The same containers, database, and application code could be reused with only modest changes to the infrastructure layer.
One dependency, however, remained constant across both implementations: OpenAI.
In this post, I replace that dependency with a fully AWS-native model layer using Amazon Bedrock. This reflects a broader industry trend, where enterprises are increasingly adopting Bedrock to integrate foundation models within existing AWS environments, using IAM-based access rather than external API keys.
Starting from the original AWS pipeline, I make the minimal changes required to replace OpenAI with Bedrock — keeping the rest of the system unchanged.
Swapping the Model Layer
The key point is that the architecture remains unchanged — only the model layer is replaced:
RAG Pipeline
───────────────────────────────────────────────────────────
Wikipedia
│
▼
Ingest Job
│
▼
Object Storage (S3 / Blob → S3)
│
▼
Indexer
│
├── Embeddings ───────────────▶ OpenAI
│ ↓
│ Amazon Bedrock (Titan)
│
▼
PostgreSQL + pgvector
│
▼
FastAPI Retrieval API
│
├── Query Embedding ──────────▶ OpenAI
│ ↓
│ Amazon Bedrock (Titan)
│
└── Answer Generation ────────▶ ChatGPT / Azure OpenAI
↓
Amazon Bedrock (Claude)
Setting up Amazon Bedrock
Before models can be invoked via Amazon Bedrock, two pieces of configuration are required: enabling model access (a manual step), and granting the correct IAM permissions.
Model access and use case submission
Access to Bedrock models is not enabled by default. For Anthropic models, AWS requires submission of a short use case form before access is granted. Navigate to Amazon Bedrock → Model catalog and request access to the required models.
For this project, I used:
- Anthropic Claude 3.5 Sonnet (answer generation)
- Amazon Titan embeddings (vector embeddings)
IAM permissions
In addition, the application (in ECS) needs to be granted permissions to invoke the model:
{
"Effect": "Allow",
"Action": [
"bedrock:InvokeModel",
"bedrock:InvokeModelWithResponseStream"
],
"Resource": "*"
}
As well as AWS Marketplace permissions:
{
"Effect": "Allow",
"Action": [
"aws-marketplace:Subscribe",
"aws-marketplace:ViewSubscriptions",
"aws-marketplace:Unsubscribe"
],
"Resource": "*"
}
Once these permissions are in place, the application can invoke Bedrock models using its IAM role, without requiring any API keys.
Consuming Bedrock in code
Once Bedrock access and IAM permissions are in place, the application changes are fairly localised. In practice, consuming Bedrock requires three things:
- configuring the Bedrock model IDs
- using Bedrock embeddings for retrieval
- using the Bedrock runtime client to call Claude for answer generation
Configuration
The first step was to replace the old provider-oriented configuration with a small Bedrock-specific settings module:
import os
AWS_REGION = "ap-southeast-2"
BEDROCK_CHAT_MODEL_ID = "anthropic.claude-3-5-sonnet-20241022-v2:0"
BEDROCK_EMBED_MODEL_ID = "amazon.titan-embed-text-v2:0"
DB_HOST = os.environ["DB_HOST"]
DB_PORT = 5432
DB_NAME = "postgres"
DB_USER = os.environ["DB_USER"]
DB_PASSWORD = os.environ["DB_PASSWORD"]
PGVECTOR_SCHEMA = "public"
PGVECTOR_TABLE = "data_wiki_rag_nodes"
EMBED_DIM = int(os.environ["EMBED_DIM"])
TOP_K = 5
TEMPERATURE = 0.2
MAX_TOKENS = 512
Query embeddings with Titan
On the retrieval side, the API now uses BedrockEmbedding from LlamaIndex to embed incoming queries:
from functools import lru_cache
from llama_index.core import Settings, VectorStoreIndex
from llama_index.core.base.base_retriever import BaseRetriever
from llama_index.embeddings.bedrock import BedrockEmbedding
from llama_index.vector_stores.postgres import PGVectorStore
from . import config
@lru_cache(maxsize=1)
def get_embedding_model() -> BedrockEmbedding:
return BedrockEmbedding(
model_name=config.BEDROCK_EMBED_MODEL_ID,
region_name=config.AWS_REGION,
)
@lru_cache(maxsize=1)
def get_vector_store() -> PGVectorStore:
return PGVectorStore.from_params(
host=config.DB_HOST,
port=config.DB_PORT,
database=config.DB_NAME,
user=config.DB_USER,
password=config.DB_PASSWORD,
table_name=config.PGVECTOR_TABLE,
schema_name=config.PGVECTOR_SCHEMA,
embed_dim=config.EMBED_DIM,
)
@lru_cache(maxsize=1)
def get_retriever() -> BaseRetriever:
embed_model = get_embedding_model()
Settings.embed_model = embed_model
vector_store = get_vector_store()
index = VectorStoreIndex.from_vector_store(
vector_store=vector_store,
embed_model=embed_model,
)
return index.as_retriever(similarity_top_k=config.TOP_K)
Calling Claude via Bedrock
For answer generation, the OpenAI chat completion call was replaced with a Bedrock runtime client and a converse(...) request:
from functools import lru_cache
from typing import Any, Mapping
import boto3
from .config import AWS_REGION, BEDROCK_CHAT_MODEL_ID, MAX_TOKENS, TEMPERATURE
@lru_cache(maxsize=1)
def get_bedrock_client():
return boto3.client("bedrock-runtime", region_name=AWS_REGION)
def answer_with_evidence(
question: str,
evidence_items: list[Mapping[str, Any]],
) -> str:
evidence_block = "\n\n".join(
f"[{i+1}] {item['page']} — {item['section']}\n"
f"URL: {item['url']}\n"
f"Excerpt: {item['excerpt']}"
for i, item in enumerate(evidence_items)
)
system_prompt = (
"You are a careful assistant answering questions using ONLY the provided "
"evidence excerpts from Wikipedia. If the evidence is insufficient, say "
"you do not know. Always cite evidence items like [1], [2]."
)
user_prompt = f"Question: {question}\n\nEvidence:\n{evidence_block}"
response = get_bedrock_client().converse(
modelId=BEDROCK_CHAT_MODEL_ID,
system=[{"text": system_prompt}],
messages=[
{
"role": "user",
"content": [{"text": user_prompt}],
}
],
inferenceConfig={
"maxTokens": MAX_TOKENS,
"temperature": TEMPERATURE,
},
)
content = response["output"]["message"]["content"]
text_parts = [part.get("text", "") for part in content if "text" in part]
return "\n".join(part for part in text_parts if part).strip()
The API endpoint
The FastAPI layer remained almost unchanged:
from fastapi import FastAPI, Query
from .llm import answer_with_evidence
from .models import AskResponse
from .retrieval import retrieve
app = FastAPI()
@app.get("/health")
def health():
return {"status": "ok"}
@app.get("/query", response_model=AskResponse)
def query(q: str = Query(..., min_length=1, max_length=2000)):
evidence = retrieve(q)
answer = answer_with_evidence(q, evidence)
return {"answer": answer, "evidence": evidence}
The indexer
The indexing side was similarly reduced to a Bedrock-only embedding path:
from llama_index.embeddings.bedrock import BedrockEmbedding
from indexer import settings
def get_embedding_model() -> BedrockEmbedding:
return BedrockEmbedding(
model_name=settings.BEDROCK_EMBED_MODEL_ID,
region_name=settings.AWS_REGION,
)
and then:
from llama_index.core import Settings
from indexer.providers import get_embedding_model
def configure_embeddings() -> BedrockEmbedding:
embed_model = get_embedding_model()
Settings.embed_model = embed_model
return embed_model
Conclusion
With Bedrock in place, the pipeline now runs entirely within AWS, from ingestion through to answer generation. The model layer is no longer an external dependency, and access is managed through IAM alongside the rest of the system.
From a code perspective, the changes were minimal — embedding and generation calls were swapped out, while the overall RAG architecture remained unchanged.
In the next part of the series, I’ll build on this same pipeline to explore more agent-like behaviour on top of it.
tags: agentic-ai - rag