AI agents & RAG
Production AI agents and retrieval pipelines that survive contact with real users.
What we ship
An agent or RAG feature that handles ten thousand real queries a week without melting. The whole thing: prompt strategy, retrieval, tool surface, rate limiting, observability, eval harness, cost router.
Model-agnostic by design. Claude, GPT, Gemini, open-weights via Bedrock or Vertex, all routed by request shape so the average request hits the cheapest model that can answer it. Most engagements cut per-session cost six to ten times versus a 'send everything to Opus' baseline.
Where teams usually break
Vector store with no tenant filter. CSV upload that overrides the system prompt. Embeddings recomputed every chat turn. SQL tool exposed to the model. No max-steps cap, so a prompt injection eats your token budget.
We have shipped fixes for every one of these in production. The patterns are in the field-notes blog. The work is in your codebase.
Stack we have in production
Claude (Anthropic API, Bedrock), GPT-5 and GPT-4o (OpenAI, Azure OpenAI), Gemini (Vertex), open-weights via Bedrock or vLLM. Retrieval on Azure AI Search, AWS Knowledge Bases, Vertex Vector Search, or pgvector. Embeddings via voyage-3, text-embedding-3, or gte. Frameworks: Claude Agent SDK, LangGraph, LlamaIndex, or hand-rolled when the abstraction costs more than it saves.
Deliverables, by line item.
- Working agent or RAG endpoint deployed to your cloud
- Retrieval pipeline with tenant isolation and cache
- Tool surface defined and constrained (no raw SQL or shell)
- Eval harness that runs in CI on every change
- Cost-router middleware (Haiku for routing, Sonnet for reasoning, Opus where it earns it)
- Production runbook + on-call dashboard
