Program syllabus

Advanced Retrieval Engineering (RAG That Actually Works)

Hands-on retrieval engineering for teams who have moved past “hello-world RAG” and need systems that stay relevant, reliable, and measurable in production.

Program overview · Audio

A short overview of the program: who it's for, what we cover, and how to get the most value out of it as a busy professional.

What we’ll cover

Use this to design a retrieval stack that is boring, reliable, and measurable — even as your content and models change.

Module 1

RAG foundations

• When retrieval is the right tool (and when it isn’t)
• End-to-end data flow: source → index → query → response
• Typical failure modes in production RAG systems

Module 2

Data prep & chunking

• Choosing units of retrieval that match how people ask
• Chunking strategies for docs, tables, and FAQs
• Metadata, sectioning, and cross-linking content

Module 3

Hybrid retrieval & ranking

• Embedding vs. lexical vs. hybrid search tradeoffs
• Re-ranking and scoring strategies
• Latency, cost, and relevance balancing

1. RAG foundations that survive real users

We make the data flow explicit so everyone can see where failures can happen — and how to instrument them.

We start with your current assistant or RAG prototype and map how data actually moves: from sources, through ingestion and indexing, into queries, ranking, and responses. Then we compare that to the flows you'd want in production.

• When you should not use RAG (and what to do instead)
• Online vs. offline retrieval paths and fallbacks
• How to think about latency, cost, and freshness
• Common anti-patterns that look fine in a demo but fail with real traffic

Artifact: RAG system map

We sketch a simple diagram that shows how queries flow through your stack:

• Content sources and ingestion jobs
• Indexes, stores, and caches
• Query → retrieve → rank → generate loop

This becomes the reference diagram for both engineering and non-technical stakeholders when you discuss reliability or new features.

2. Data prep, chunking, and metadata

We design content pipelines around how people actually ask questions, not just how documents are stored today.

Most RAG failures start before a single query is run. We focus on units of retrieval: what exactly do you want the model to see, and how do you make that unit easy to find?

• Choosing chunk sizes and overlaps by content type
• Special handling for tables, FAQs, how-tos, and reference docs
• Designing metadata and sectioning that enable precise filtering
• Strategies for keeping indexes fresh without constant full-rebuilds

Artifact: retrieval-ready schema

Together we define a target schema for your retrieval store (vector, search index, or both):

• Fields for text, metadata, and relationships
• Required vs. optional attributes
• Example documents before and after transformation

This schema guides ingestion work and avoids one-off pipelines for every new content source.

3. Hybrid retrieval, ranking, and fallbacks

We go beyond "top-k from the vector store" to a layered retrieval strategy that balances relevance, latency, and cost.

• When to use lexical, dense, or hybrid retrieval
• Configuring top-k, thresholds, and filters sanely
• Re-ranking strategies: LLM re-rank, cross-encoders, signals from clicks and feedback
• Caching, fallbacks, and "no good answer" paths

Templates you can reuse

• Example hybrid retrieval pipeline configuration
• Fallback matrix for different failure modes
• Checklist for shipping retrieval changes safely

4. Evaluation, regression testing, and observability

We wire retrieval into tests and dashboards so you can safely iterate on prompts, models, and indexes.

Retrieval quality drifts over time as content, users, and models change. We design lightweight guardrails so you catch issues before customers do.

• Building small but representative golden sets
• Measuring retrieval quality separately from generation
• Telemetry and traces for end-to-end retrieval flows
• Dashboards that highlight relevance regressions

Artifact: retrieval quality workbook

We assemble a small evaluation workbook for one of your assistants:

• 20–50 queries with expected behaviors
• Baseline metrics you can rerun over time
• A simple process for adding new test cases

Ready to ship RAG that actually works?

We typically run this as a 1–2 week engagement anchored on one high-value assistant or workflow. You bring real data and failure modes; we bring patterns, templates, and a ruthless focus on production behavior.

By the end, you'll have a retrieval design, evaluation plan, and observability story your team can own and iterate on.

Talk to us about this program View all teams programs

This pairs especially well with Workflow AI Design Blueprint and AI Assistant Observability & SLOs for a full stack of reliability-focused AI capabilities.