hugo palma.work
Thought Logged Feb 9, 2026

I Let an AI Interview Me, Then Data-Analyzed My Own Answers

A first-person account of an AI-moderated technical interview for a Generative AI Engineer role, with a per-answer analysis that reveals what automated evaluation can and cannot measure.

7 Questions Analyzed 15 Minutes of Audio February 2026

Let's do this

In my last article, I described adding a second question alongside the ATS scoring prompt: instead of only asking "score me against this job," I started also asking "should I even bother applying?" The ATS score stayed for the public analysis. The new question runs privately, predicting realistic interview probability, accounting for pipeline friction, and classifying each listing as APPLY, APPLY_WITH_MODIFICATIONS, or SKIP.

This is the first real test of that system. It flagged a Generative AI Engineer position at a large tech transformation company as APPLY. The AI scored me well enough. So the application went out.

Next day, on a Saturday morning, a WhatsApp message popped up. An AI bot was ready to interview me. Seven questions, timed responses, audio format. No human involved at any stage.

I hadn't read the job description. I hadn't prepared anything. I didn't even know which company it was for until the bot told me. But I was curious about the process. I had been studying, writing about, and building around AI hiring pipelines for over a month. And now I was inside one.

I pressed record and started talking.

The Format

The interview was entirely asynchronous. A WhatsApp bot sent a question, I recorded an audio response, and the bot moved to the next one. Seven questions total: two warmups about my background, one about LLM experience, and four technical stages covering production AI deployment, RAG pipeline design, prompt engineering, and multi-agent systems.

Each technical question had a 5-minute window. I averaged 2 minutes and 20 seconds per answer. Not because I ran out of things to say, but because I don't pad. When the point is made, I stop.

Question Topic Time Used Available
1 Rewarding project 2:39 3:00
2 Professional journey 1:34 --
3 LLM experience 2:07 --
4 Production AI deployment 2:53 5:00
5 RAG pipeline design 2:38 5:00
6 Prompt engineering 2:22 5:00
7 Multi-agent AI systems 1:30 5:00

The RAG Question: Deriving a Best Practice in Real Time

This is the answer I want to dissect the most, because it shows exactly what automated evaluation misses.

The question asked me to design a Retrieval-Augmented Generation pipeline for a client with large document repositories. I had never built a RAG system. I had never even read about one in detail. I said that upfront: "I have never actually built a RAG. I know how it works, but I didn't get to it yet. But let's think about the problem."

Then I reasoned from constraints. The client has a large document repository. The goal is to retrieve relevant information without wasting compute. My approach: first, index the entire repository by keyword density to get a cheap, fast pre-filter. Only after narrowing down the search space would I tokenize and send content to the LLM context window. Then evaluate, iterate, and optimize.

That approach has a name in the industry: hybrid search. It's considered a best practice in production RAG systems. The field started with pure vector (semantic) search, discovered it misses exact matches, and eventually added keyword-based retrieval as a first pass. The industry arrived at this pattern through years of production failures and iteration.

I arrived at it in 2 minutes and 38 seconds from cost reasoning alone.

What I missed was the vocabulary. I never said "embeddings," "vector database," "chunking," "semantic search," or "re-ranking." I also called my own approach "just me rambling," which didn't help. On a keyword-matching evaluation, this answer fails. On a reasoning evaluation, it's one of the strongest in the set.

The gap was vocabulary, not reasoning.

The only concept I couldn't have derived from first principles was embeddings: the idea that text can be represented as high-dimensional vectors for semantic similarity. That's a learned concept. Everything else in the hybrid search pattern, the staged retrieval, cost gating, evaluation loops, I independently derived.

The Production AI Answer: Where the Numbers Were

The strongest technical answer was about production AI deployment. I described my multi-model evaluation pipeline: how I used multiple LLMs to score the same job description, measured the standard deviation between their outputs, and found a 26% classification variance across models. That was unusable. I needed the models to be interchangeable to save cost.

So I treated the prompt as a PID control problem. Proportional: measure the current error. Integral: track accumulated drift. Derivative: anticipate the direction. I parameterized the scoring rules, tuned them iteratively, and brought that 26% standard deviation down to 2.9%. Chi-Square validated, p < 0.001.

PID-based prompt engineering isn't a thing anyone teaches. I applied it because that's my background: mechatronics, control theory, reinforcement learning. The concept transferred directly.

The weakness: I opened with slight frustration. "I can describe what I've already been describing throughout the entire interview." I had already covered parts of this in previous answers. Repeating myself felt like waste. But an automated evaluation scores each answer independently. It doesn't carry context from one answer to the next.

The Prompt Engineering Reframe

The question asked about optimizing prompts for an LLM in an enterprise use case. I challenged the premise. Prompting alone isn't enough. I escalated the solution space: prompting, then caching enterprise requirements as persistent context, then fine-tuning, then reinforcement learning with high-quality company-specific data.

I referenced double descent (the phenomenon where a model's test error first decreases, then increases, then decreases again as training continues past the interpolation threshold). I talked about convergence, overfitting thresholds, and when to apply reinforcement learning. These aren't API-level concepts. They're training-level concepts.

What I didn't do: mention a single concrete prompting technique. No few-shot, no chain-of-thought, no system prompts, no structured output. The question asked specifically about prompting. I gave a model optimization answer instead. That's a valid senior-level take, but 30 seconds on actual techniques before pivoting would have made it complete.

The Multi-Agent Question

This was my weakest answer. 1 minute and 30 seconds out of 5 available. The question explicitly named AWS Bedrock Agents and AgentCore. I mentioned neither.

What I gave instead was skepticism. I questioned whether multi-agent systems need as many agents as people think. I emphasized starting with a solid baseline before adding orchestration layers. I said I'd had "mixed results" from testing.

That skepticism was earned. I had tested multi-agent patterns where different personas intercede at steps in a pipeline to verify and iterate. The results didn't justify the complexity. A single, well-prompted model with good context outperformed the committee approach in my testing.

What I didn't know at the time was that multi-agent systems have a second, much more useful pattern: parallel execution of independent tasks. Not personas checking each other's work, but separate workers handling separate problems simultaneously and merging results. That's concurrency, a concept I understand deeply from building my own process orchestrator. The terminology primed me toward the wrong pattern.

The irony: Right after the interview, while Claude Opus was doing this exact analysis, I noticed it call a subagent and work on two things at the same time. My reaction was the only one I could have, I just thought "Fuuuck...", realizing my own stupidity during the interview and went to study more.

I also had already made the "I build first, learn the managed service after" argument about Bedrock in Question 3. Repeating it felt redundant. But the bot doesn't remember Question 3 when scoring Question 7.


What the Bot Said

After the interview, the bot sent a summary evaluation. It praised my "technical knowledge," my "problem-solving skills and practical approach," and my "analytic mindset." It suggested I "structure answers to provide clearer summaries." Then it told me the next step was with the recruiter.

That evaluation was generic. It did not score individual answers. It did not identify the RAG vocabulary gap. It did not note the Bedrock non-answer. It did not recognize the PID framework as original cross-domain work. It did not assess time management. It read like a template that says something encouraging to almost anyone.

The bot received a first-principles RAG derivation and returned a template. It received an original prompt engineering methodology with statistical validation and returned "your analytic mindset is truly valuable." It cannot evaluate how fast a person thinks, their systems thinking, or their drive.

Which is exactly what I told it in my first answer.

The Structural Contradiction

I went back and read the job description after the interview. The "What Sets You Apart" section described exactly what I demonstrated (I hope): someone who understands model architectures (not just integrates APIs), balances innovation with pragmatism, thinks holistically about performance and cost, and navigates ambiguity.

The interview questions and automated evaluation optimize for the opposite: specific framework familiarity, named tools, explicit years of experience.

This isn't a complaint about one company. This is the pattern I've been measuring across the 5,000+ job listings. Companies describe the innovator they want in the job description, then build hiring systems that filter for compliance instead. The process is structurally incapable of identifying the person the aspirational section describes.

The Core Problem

  • The JD describes an innovator. Someone who thinks from first principles, understands architectures, and navigates ambiguity.

  • The interview filters for tool familiarity. Specific frameworks, named platforms, explicit years of experience.

  • The automated evaluation can't tell the difference. A first-principles RAG derivation and a memorized textbook answer produce the same template response.

What a Person Would Have Seen

A senior technical lead sitting across from me would have heard me say "I've never built a RAG" and then watched me derive hybrid search in real time. They would have stopped me, asked "wait, go back, why did you think about it that way?" and pulled on the thread. That question has produced more breakthroughs than any correct answer ever has.

A person adapts the interview based on what you've already demonstrated. "You covered Bedrock in Question 3, so let's go deeper on the orchestration design instead." That's 3.5 minutes of useful signal instead of 3.5 minutes of silence because the candidate already said what he had to say.

A person notices when a candidate challenges 4 of 7 questions and distinguishes between combativeness and intellectual honesty. A person sees the PID framework applied to prompt engineering and recognizes it as original cross-domain thinking, not a missing checkbox.

A bot cannot do any of this. And this is the thesis I have been building evidence for across every study on this website: you have to put a person in the loop.


My Own Evaluation

I rated myself honestly. The overall score is 6.3/10. Not because the answers were bad, but because the execution was inconsistent.

Topic Score Signal Noise
Rewarding project 7/10 5,000-listing pipeline, PID prompting, statistical validation Critiqued the process in answer #1, ran out of time
Professional journey 5/10 Tool-agnostic mindset, intrinsic drive No company names, no timelines, no specifics
LLM experience 6.5/10 Named tools used, pragmatic production definition Called cloud "somebody else's computer" to a cloud company
Production AI 7.5/10 26% to 2.9% std dev, PID framework, concrete metrics Slight frustration at repeating prior answers
RAG pipeline 6/10 Derived hybrid search from first principles in real time Zero domain vocabulary, self-deprecating language
Prompt engineering 7/10 Double descent, convergence, correct escalation hierarchy Zero concrete prompting techniques mentioned
Multi-agent systems 5/10 Valid skepticism, baseline-first thinking 3.5 minutes unused, named frameworks ignored

The pattern across all answers: I naturally expand scope when the question wants depth on one thing. That's a strength in architecture. It's a liability in structured interviews.

Why I Published This Before the Result

I don't know if I passed or failed. This article was written before the company reached a decision. That's deliberate.

If I published this after a rejection, it reads as bitterness regardless of content. If I published after an acceptance, it reads as humble-bragging. Publishing it while the outcome is unknown means the analysis stands on its own.

The point was never about one company or one interview. The point is about what we lose when we remove humans from the evaluation of humans. A bot that can't tell the difference between a memorized textbook answer and a first-principles derivation is not evaluating intelligence. It's evaluating compliance.

And compliance is one thing I'm genuinely bad at.

Methodology

  • Audio recovery: 7 WhatsApp audio files matched to questions using audio duration as a join key between filesystem metadata (macOS afinfo) and WhatsApp chat UI timestamps. All matches unambiguous.

  • Transcription: OpenAI Whisper (medium model, English). One known artifact: hallucinated YouTube outro at the end of Answer 6.

  • Evaluation: Sent only the questions and job title to Claude Opus 4.6, asked it to transcribe or use whisper that I had installed then evaluate each answer against the exact question text. AI-scored with documented reasoning. No external input on scores.

  • Company: Anonymized. The analysis is about the process, not the company.

End of journal

Status: ARCHIVED