hugo palma.work
Thought Logged Feb 17, 2026

Learning RAG while benchmarking it

I sat through a technical interview for a Generative AI Engineer position a few days ago. One question: "Design a RAG pipeline." Never touched RAG. Never built one, never read about one, never used one. So I did what I always do when I don't know something. I reasoned from what I did know.

My proposal was to index documents by keyword density as a pre-filter before tokenization and sending to the LLM. Don't send everything, send what's relevant, and figure out relevance cheaply before burning tokens. Turns out what I described is called hybrid search, an industry best practice in RAG systems. I just didn't know the name for it.

None of that registered with the automated evaluation. It pattern-matched for vocabulary I didn't have ("embeddings", "vector database", "cosine similarity") and moved on.

So I built the whole thing from scratch. Not to prove the bot wrong, but to find out where my reasoning holds up and where it breaks. Every prediction I make before running the numbers is a falsifiable claim. This article is the result.

What RAG Actually Is

RAG stands for Retrieval-Augmented Generation. Simple idea: instead of sending your entire dataset to an LLM (expensive, often impossible), you first retrieve the most relevant pieces, then send only those as context for the LLM to generate an answer.

Four steps in the pipeline:

  1. Load: Read your documents into memory
  2. Chunk: Split them into smaller pieces (because embedding models have token limits and smaller pieces are more precise)
  3. Embed: Convert each chunk into a numerical vector that captures its meaning
  4. Query: Convert your question into a vector, find the closest chunks by cosine similarity, feed them to an LLM

That's the theory. Practice is where it gets interesting.

The Dataset

I had 5,559 job listing markdown files sitting in a folder. These came from a scraper I built for a previous project, a resume autogenerator that matches your resume to job descriptions. All HTML had already been stripped during scraping to save database space and reduce token count for the ATS evaluation pipeline. That decision had nothing to do with RAG, but it ended up helping.

Each .md file has a somewhat structured layout (company name, role title, description, requirements, benefits) but also a lot of noise. Copilot analysis tags, LinkedIn boilerplate, equal opportunity statements, redirect URLs. Stuff that exists in every job listing and carries zero semantic value.

I knew this going in. What I didn't know was how much it would matter.

The Methodology

Before writing a single line of benchmark code, I set some ground rules. Without isolating variables properly, the numbers would mean nothing.

One LLM, No Exceptions

Every query across every run goes through Claude Opus via the same prompt template. Never changes. It sees whatever chunks the retrieval step gives it and generates an answer. Any difference in answer quality comes from the retrieval, not the generation.

Five Embedding Models

I tested 5 embedding models: 3 paid APIs and 2 open-weight models running on HuggingFace's Inference API. Same 5,559 files, same chunking strategy (1000 chars, 200 overlap), same questions. Only thing that changes is which model turns the text into vectors.

Model Dimensions Cost / 1M tokens Provider
gemini-embedding-001 3072 $0.15 Google
text-embedding-3-small 1536 $0.02 OpenAI
text-embedding-3-large 3072 $0.13 OpenAI
e5-large-instruct 1024 free HuggingFace API
bge-base-en-v1.5 768 free HuggingFace API

Dimensions range from 768 to 3072. Cost ranges from free to $0.15 per million tokens. Does paying more actually buy you better retrieval?

Threshold, Not Top-K

Most RAG tutorials use a fixed top-k (send the 5 or 10 most similar chunks). I started there and it was terrible. 5 chunks out of 50,000 is 0.01% of the data. Looking through a keyhole while trying to understand a job market.

Instead, I use a cosine similarity threshold. Every chunk above 0.80 similarity gets included, up to a safety cap of 200. Some queries get 20 chunks, others get 200, depending on how much relevant content exists. Retrieval adapts to the question instead of forcing an arbitrary number.

18 Semantic Queries

After learning the hard way that counting questions don't work (more on that later), I designed 18 queries across 6 categories. Every question requires the LLM to reason, not just look up a fact:

Benchmark Query Categories

  • Synthesis (3): Combine information from multiple chunks into one coherent answer. "What does a typical senior ML engineer role look like day-to-day?"

  • Comparison (3): Identify differences between role types. "How do junior versus senior AI roles differ in what they expect?"

  • Inference (3): Draw conclusions from indirect signals. "Which roles seem to expect someone who can work independently?"

  • Pattern (3): Spot recurring themes. "What soft skills keep appearing across AI engineering descriptions?"

  • Nuanced Retrieval (3): Test semantic understanding over keyword matching. "Find roles focused on data quality and pipeline reliability rather than model building."

  • Analysis (3): Form opinions about the content. "Which roles seem the hardest to fill and why?"

One of the analysis queries is a deliberate trap: "Which job descriptions seem the most well-written versus vague and generic?" RAG retrieves by similarity to the query, so it'll find chunks that talk about writing quality, not a diverse sample of good and bad writing. I kept it in because seeing how each model handles an unfair question is itself useful data.

Two Axes

This isn't just "run 5 models and compare." It's a two-axis experiment:

  1. Baseline: Embed the raw .md files as-is, with all the noise, weird tags, Copilot output, boilerplate. No data treatment. This is run zero.
  2. Progressive cleaning: Strip boilerplate, extract structured fields (company, title, location, skills), improve chunk boundaries. Re-embed after each cleaning step and re-benchmark.

Each model gets its own baseline, then its own cleaned runs. How much does cleaning improve each model? Can a cheap model with clean data beat an expensive model with dirty data?

Building the Pipeline

Stack: Python, LangChain for text splitting, and a model registry that lets me swap embedding providers with a single flag. Chunking uses a two-pass strategy: first split by markdown headers (#, ##, ###), then split those sections into 1000-character pieces with 200-character overlap.

5,559 markdown files produced 51,545 chunks. Each chunk gets sent to the embedding API, which returns a vector (a list of floats representing the text's meaning in high-dimensional space). Vectors get saved to a pickle file so I don't have to re-embed every time I query.

# Embed all files with a specific model
python rag_pipeline.py --step all --limit 0 --run-name gemini_baseline
python rag_pipeline.py --step all --limit 0 --run-name openai_small_baseline --embed-model text-embedding-3-small

# Benchmark a run with all 18 queries
python benchmark.py --run gemini_baseline --embed-model gemini

Batching with checkpoints (so a crash at minute 15 doesn't lose everything), retry with backoff for rate limits, incremental result saving during benchmarks. I learned all of these the hard way, by losing runs.

The First Thing I Got Wrong

My first benchmark had 18 questions. Nine of them started with "How many." As in: "How many jobs require Python?", "How many jobs mention RAG?", "How many are pure software development?"

Counting questions. Database questions. grep questions. I was asking an LLM to count across 5,559 files while only showing it a similarity-filtered slice of maybe 200 chunks. Answers were technically correct given the context, and completely useless.

It saw 5 chunks that mentioned Python and answered "at least 5 jobs require Python." When asked about pure software dev roles, it saw 5 ML-related chunks and said "Zero." Salary? The chunks it received didn't have compensation info, so it said "None of the listings mention salary."

I would have gotten better results with:

find jobs/ -name "*.md" -exec grep -l "Python" {} \; | wc -l

What RAG Is Good For vs What It's Not

  • Not good for: Counting, aggregation, filtering, exact lookups. Anything deterministic. These are database problems. Use SQL.

  • Good for: Synthesis, comparison, inference, pattern recognition. Questions that require reasoning across unstructured text. Things you can't grep for.

The Boilerplate Problem

Before running the benchmark, I analyzed what was actually in my chunks:

  • 5.4% of chunks (2,778 out of 51,545) had 3 or more boilerplate patterns
  • 11% contained AI REASONING or Copilot tags from my resume autogenerator
  • 1,792 had LinkedIn's "DESCRIPTION About the job" boilerplate header
  • 599 had LinkedIn redirect URLs eating up characters for zero value
  • Hundreds with equal opportunity statements, E-Verify text, generic benefits

Every one of these chunks costs tokens to embed, takes up space in vector memory, and competes for retrieval slots when you query. A boilerplate chunk scoring 0.85 similarity can push a genuinely relevant chunk at 0.83 out of the results.

Flat Similarity

I expected cosine similarity scores to spread out. Highly relevant chunks near 0.95, irrelevant ones near 0.5, with a clear boundary to threshold on. Not what happened.

With Gemini embeddings, everything clustered in the 0.79 to 0.85 band. Best match for a query might be 0.83. Match #200 was 0.80. Setting a threshold at 0.80 either captured 18 chunks or 3,000 chunks depending on the query, with no middle ground.

Embeddings were too flat. They couldn't separate signal from noise because the noise (boilerplate) shares vocabulary with real job content. "Experience" appears in both "5+ years Python experience" and "equal opportunity regardless of experience level." When all the surrounding context is job-listing language, the embedding model can't tell the difference.

Benchmark Results: Baseline

All five models embedded the same 51,545 chunks from 5,559 raw markdown files. No cleaning, no preprocessing beyond the two-pass chunking. Same 18 queries, same 200-chunk cap, same LLM.

Model Queries Avg Top-1 Spread Avg Tokens
e5-large-instruct 18 0.879 0.025 25,992
gemini-embedding-001 18 0.848 0.040 24,885
bge-base-en-v1.5 18 0.742 0.073 33,936
text-embedding-3-small 18 0.598 0.097 38,496
text-embedding-3-large 18 0.571 0.096 37,895

First thing that jumps out: these similarity scores are not on the same scale. E5 averages 0.879 top-1 while OpenAI Large sits at 0.571. That doesn't mean E5 is better at retrieval. Different embedding models produce different similarity distributions. What matters is the spread, how well each model separates the best match from the worst within its own range.

OpenAI's models have spreads of 0.096-0.097. Real daylight between best and worst matches. BGE sits at 0.073. E5's spread is 0.025, tighter than anything else, with the top match barely distinguishable from chunk #200. That's the flat similarity problem I described earlier, and some models suffer from it way more than others.

Fig 1. Average top-1 cosine similarity per query category across all five embedding models.

Category breakdown shows where retrieval quality diverges most. Nuanced retrieval (questions requiring semantic understanding over keyword matching) is where E5 and Gemini pull furthest ahead. Analysis category has high variance too, partly because the trap question about "well-written vs generic" job descriptions drags down averages for models that take its bait.

Latency averaged 25-34 seconds end-to-end per query, with LLM generation eating 70-80% of that. Embedding lookup itself is fast. Generation is slow. Token usage tells an interesting story too: E5 and Gemini averaged ~25-26K tokens per query while OpenAI and BGE sent ~34-38K. More tokens means the retrieval is pulling in more context because it can't tell what's relevant from what's noise.

Embedding Model Comparison

Now cost enters the picture. Two of these models are free, one costs pennies, and two cost real money. Does paying more buy meaningfully better retrieval?

Model Dims Cost/1M Avg Top-1 Avg Floor Spread Avg Latency
e5-large-instruct 1024 free 0.879 0.853 0.025 25.9s
gemini-embedding-001 3072 $0.15 0.848 0.808 0.040 32.0s
bge-base-en-v1.5 768 free 0.742 0.669 0.073 26.5s
text-embedding-3-small 1536 $0.02 0.598 0.501 0.097 28.6s
text-embedding-3-large 3072 $0.13 0.571 0.475 0.096 34.1s

On raw similarity scores, the free models win. E5-large-instruct (free, 1024 dims) beat everything including Gemini at $0.15 per million tokens with 3x the dimensions. BGE (free, 768 dims) outperformed both OpenAI models despite having fewer dimensions than either.

But raw similarity is only half the story. Chart below shows each model's top-1 similarity alongside its floor similarity. Gap between them is the spread: how well the model discriminates between best and worst matches in its retrieval window.

Fig 2. Top-1 vs floor similarity for each model. The gap between bars is the spread. Wider gaps indicate stronger discrimination between relevant and irrelevant chunks. Hover for exact spread values.

E5 has the highest bars but the narrowest gap between them. All five models. Gemini has a more comfortable spread. BGE shows a surprisingly wide gap for a free model. OpenAI models sit lower on the absolute scale but show the widest gaps, which could mean better discrimination or just noisier embeddings. Baseline alone can't tell the difference.

Cleaning will test whether that discrimination matters. If a model with tight clustering gets better with clean data (chunks separate more, spread widens) it was being held back by noise. If a model with wide spread doesn't improve much, it was already coping with the mess. That answer determines which model you'd actually pick for production.

Progressive Data Cleaning

Cleaning stripped AI Analysis sections, Interview Insights, equal opportunity statements, benefits boilerplate, LinkedIn redirect URLs, and Copilot tags. Result: 35% fewer chunks (51,545 down to 33,409 for HuggingFace models, 36,523 for Gemini due to its different tokenizer). Same 5,559 files, same chunking strategy, just less noise per file.

My prediction going in was simple: removing boilerplate should widen spreads (better discrimination) even if top-1 scores drop slightly, because the model no longer wastes retrieval slots on chunks that match on shared vocabulary instead of actual content.

What actually happened:

Model Baseline Top-1 Clean Top-1 Baseline Spread Clean Spread Baseline Tokens Clean Tokens
e5-large-instruct 0.879 0.872 0.025 0.034 25,992 36,099
gemini-embedding-001 0.848 0.850 0.040 0.045 24,885 27,770
bge-base-en-v1.5 0.742 0.724 0.073 0.069 33,936 44,365
text-embedding-3-small 0.598 0.582 0.097 0.104 38,496 44,475
text-embedding-3-large 0.571 0.561 0.096 0.099 37,895 43,583

Top-1 scores barely moved. E5 dropped 0.007, Gemini actually went up 0.002, BGE dropped 0.018. In absolute terms, cleaning didn't make retrieval "better" by the most obvious metric. Wrong number to look at though.

Spreads tell the real story. E5 went from 0.025 to 0.034, a 36% increase in discrimination. Its FLAT query count dropped from 6/18 to just 1/18. The model that was most compressed in baseline benefited the most from cleaning. With the noise removed, E5 can actually tell the difference between a highly relevant chunk and one that just happens to share job-listing vocabulary.

Gemini widened from 0.040 to 0.045. OpenAI Small from 0.097 to 0.104. Models that already had wide spreads didn't move much. They were already coping with the mess by distributing similarity scores more broadly.

BGE is the odd one out: its spread actually narrowed from 0.073 to 0.069. One possible explanation: BGE was using boilerplate content productively on some queries (the benefits question Q12, Amazon mentorship chunks on Q14), and removing that content removed signal for those specific queries. Not all boilerplate is pure noise if the question is about the boilerplate topic.

The Token Paradox

Counterintuitive part: every model used more tokens per query after cleaning. E5 jumped from 25,992 to 36,099 tokens (+39%). BGE went from 33,936 to 44,365 (+31%). Less data in, more tokens out.

Makes sense once you think about it. With 35% fewer chunks but the same 200-chunk retrieval cap, each chunk now contains denser actual content. In baseline, many of those 200 slots were occupied by boilerplate chunks (short, repetitive, low-token). After cleaning, those slots fill with real job descriptions that are longer and more substantive. Same size retrieval window, better stuff filling it.

Fig 3. Spread comparison between baseline and cleaned data. Wider spread = better discrimination between relevant and irrelevant chunks. E5 showed the largest improvement.

What Cleaning Actually Did

  • Top-1 barely moves: Best match for any query is roughly the same with or without boilerplate. The best chunk was already good.

  • Spread is what improves: Distance between best and worst retrieved chunks widens, meaning the model discriminates better. E5's jump from 0.025 to 0.034 moved it from "flat" to "functional."

  • Tokens go up, not down: Denser content per chunk means more tokens per query. LLM sees more real information per retrieval window.

  • 35% fewer vectors, same quality: You can store and search a third less data with no meaningful loss in retrieval accuracy. Direct infrastructure savings.

The Verdict

RAG does not replace data engineering. It shifts the retrieval problem from SQL to cosine similarity, but garbage in, garbage out still applies. Clean, structured data produces better embeddings, better retrieval, and better answers.

And you are still losing accuracy. In a perfect world with an infinite context window and infinite budget, you would just send everything and get better results. RAG is a lossy compression of your dataset. You select a subset based on vector similarity and hope the relevant stuff floats to the top. By definition, you are discarding information.

Tradeoff is in the wallet. RAG lets you query datasets that would never fit in a context window, at a fraction of the cost of processing everything. But you pay for that with precision. And the only way to recover that precision is to treat your data before it ever touches an embedding model.

Which is exactly what I said in the interview, without knowing any of the words for it.

What I Actually Learned

Two things jumped out that I didn't expect going in.

First: my dataset was too homogeneous. 5,559 job listings all follow the same structure, use the same vocabulary, and talk about the same industry. When everything looks similar, the embedding model can't separate a great match from a decent one. That's why the spreads were so tight. Not a model problem. Data problem. RAG shines when your corpus is diverse: different topics, different authors, different time periods. A collection of emails, research papers, or support tickets would stress-test retrieval in ways that 5,559 copies of roughly the same document never could.

Second: I was sending way too much context. A 200-chunk retrieval cap means the LLM sees so much data that precision stops mattering. At that volume, there's enough signal buried in the noise regardless of how good the retrieval is. That's why cleaning barely moved the needle. Models were brute-forcing their way to decent answers through sheer volume. With a top-5 or top-10 retrieval window, every slot matters, and the difference between a clean dataset and a dirty one would actually show up in the answers.

Both things combined explain why the cleaning round felt like a wash. Data was too similar for embeddings to discriminate, and the retrieval window was too large for discrimination to matter anyway. I was testing a scalpel by using it as a sledgehammer.

Pipeline itself is solid though. Batching, checkpoints, model swapping, benchmarking, the whole thing works. So I'm going to point it at something that will actually challenge it. I have about 2GB of archived email sitting on my machine. Different senders, different topics, different time periods spanning years. Asking "what did John say about the migration deadline" actually requires precise retrieval, not broad pattern matching across thousands of nearly identical documents.

If you want to see the code, plus unfiltered results, check out on github

That's the next experiment.

End of journal

Status: ARCHIVED