I Reverse-Engineered ATS by accident

I wasn't trying to expose the ATS industry. I was trying to understand why I kept getting rejected for jobs I was clearly qualified for. 3,292 job postings and one month later, I accidentally proved something the HR tech industry doesn't want you to know: your resume isn't being evaluated by AI. It's being fed through a glorified keyword counter with a $50,000/year price tag.

This is that story.

It Started With a Simple Question

Why was LinkedIn showing me garbage matches? I'm a systems architect with a track record of building production infrastructure - self-hosted LLM clusters, algorithmic trading bots, enterprise supply chain systems. Yet LinkedIn kept suggesting I apply for entry-level help desk positions and "building architect" roles (the kind that involves actual buildings, not software).

So I did what any reasonable person would do: I built a scraper, pulled 3,292 job postings, and started reverse-engineering the algorithm.

My initial goal was simple: create a scoring system that matched my intuition about job fit. I used Gemini with a heavily parametrized prompt - specific scoring rules, semantic analysis, recency weighting, the works. I also ran the same jobs through Microsoft Copilot, which randomly selected between GPT and Claude models with a looser prompt.

The problem? My two systems disagreed. A lot. The standard deviation between Gemini and Copilot scores was 26%. That's not a rounding error - that's two systems looking at the same data and reaching wildly different conclusions.

The Money Problem

Here's where it gets interesting. Running two AI systems on the same jobs isn't cheap. I was burning through API credits, hitting rate limits, watching my monthly budget evaporate. I needed my systems to converge - not because I cared about accuracy, but because I wanted to stop paying for two different opinions on the same job.

So after 600 jobs, I heavily parametrized the Gemini prompt - specific scoring rules, point values, keyword matching logic. The result? Standard deviation dropped from 26% to 2.9%. Good enough. Instead of running both models on every job, I started randomly picking one per job - they agreed closely enough that double-checking was a waste of money.

From there, I kept tweaking. Not scientifically - I'm talking pure vibes:

Epoch 1 (Jobs 1-600): Both models on every job, unrestricted. 26% disagreement.
Epoch 2 (Jobs 600-1500): Parametrized prompts, random model per job. Added hard cap at 65 - jobs scoring 100 still had missing required skills, so perfect scores were lies.
Epoch 3 (Jobs 1500-1850): 65 cap felt too harsh. Added "divide the difference by 3" rule - score = 65 + (raw - 65) / 3. Literally pulled from my ass.
Epoch 4 (Jobs 1850-2700): Added exclusion filter. Enforce score=0 for obvious mismatches.
Epoch 5 (Jobs 2700+): Rewrote the scraper. Stopped caring about elegance.

Each tweak after epoch 1 was arbitrary. No A/B testing. No statistical validation. Just "this feels wrong" followed by "let me try this instead."

Problem solved. I was only paying for one model, and the scores felt reasonable.

Except I noticed something else.

The Accidental Discovery

While making my systems converge with each other, I had also converged with LinkedIn.

My heavily parametrized Gemini prompt - the one I built purely to save money - achieved 46.7% agreement with LinkedIn's ATS (Chi-square: X^2 = 34.06, p < 0.001). That's statistically significant convergence. My bullshit vibes-based system was producing the same results as LinkedIn's "AI-powered" matching.

Fig 1. Both my systems converged with LinkedIn. The strict, cost-optimized prompt performed slightly better.

Let that sink in. I wasn't trying to replicate LinkedIn. I was trying to save money on API calls. But the same incentive that drove my engineering decisions - cost optimization - apparently drove LinkedIn's too.

The Uncomfortable Truth

Same incentives produce same results. I optimized for cost. LinkedIn optimized for cost. We both ended up with keyword matchers dressed in AI clothing. The only difference is they charge $50,000/year for theirs.

What LinkedIn's ATS Actually Does

To understand why this matters, you need to understand what ATS vendors claim to sell versus what they actually deliver.

What they claim:

"AI-powered candidate matching"
"Advanced semantic analysis"
"Machine learning talent acquisition"
"Intelligent skills inference"

What they actually deliver:

A prompt that says "match these keywords"
Arbitrary score thresholds
An LLM API call
A pretty UI

I know this because I accidentally built the same thing. Here's the prompt that achieved convergence with LinkedIn:

How to Lobotomize a Frontier Model

You are an expert Career Matcher simulating enterprise ATS behavior.

=== MATCHING ALGORITHM ===

1. SEMANTIC ANALYSIS: "containerization" = "Docker" = "containers"
2. SKILL INFERENCE: "CI/CD pipelines" -> infers: Git, automation, testing
3. RECENCY WEIGHTING: Last 3 years = 100%, 3-7 years = 70%, 7+ years = 50%

=== SCORING RULES ===

BASE SCORE: Start at 50

ADDITIONS:
   +15: Direct keyword match
   +10: Inferred skill
   +8:  Adjacent technology
   -15: Missing required skill
   -25: Critical domain mismatch

Final score must be between 0 and 100. Return JSON only.

Look at what I did here. I took a model capable of nuanced reasoning about human capability and reduced it to: match keywords, add points, subtract points, output number. The model isn't thinking - it's executing arithmetic I dictated. This is what "AI-powered" hiring looks like under the hood.

The Smoking Gun: Convergence Without Accuracy

Here's where the ATS industry's dirty secret gets exposed. They love to brag about "convergence" - how well their system agrees with existing hiring decisions. High convergence sounds impressive on a sales call.

But convergence doesn't mean accuracy. It means you're making the same mistakes as everyone else.

I compared my vibes-based system against GLM-4, a "proper" AI model with algorithmic scoring. Important caveat: GLM-4 ran with the strict parametrized prompt from job 1 - it never went through the vibes-based tweaking period. It started with the "finished" prompt. The results are damning:

Fig 2. GLM4 converged 8.3% MORE with LinkedIn - but identified ZERO of my top applicant jobs. My vibes found 13.

GLM4 identified zero of my top applicant jobs. ZERO.

Out of 66 jobs LinkedIn marked as "top match" for my profile, the "AI-powered" GLM4 system agreed on exactly none of them. It had 45.3% convergence with LinkedIn's overall scoring but 0% recall on the jobs that actually mattered to me as an applicant.

My vibes-based system? Found 13 out of 66. That's 19.7% recall. Not great, but infinitely better than zero.

This is like a smoke detector that agrees with the fire department's historical data 45% of the time but fails to detect any actual fires. The convergence metric is meaningless if you're missing all the signals that matter.

Context Blindness: The Hidden Failure Mode

I categorized all 3,292 jobs by technical complexity. What I found explains why ATS systems fail so spectacularly at matching talent:

Fig 3. LinkedIn's personalization bias: 72.7% technical roles because I searched "Systems Architect." Note the keyword pollution - actual building architects leaked into my results.

ATS systems treat every job type identically. They don't understand that a SYSTEMS_ARCHITECT role requires different evaluation criteria than a CLERICAL_ADMIN position. They just count keywords.

Fig 4. My system shows 31% spread across job categories. GLM4 barely differentiates - it's just counting keywords regardless of role complexity.

Look at that spread. My vibes-based system scores me 59.0 for Systems Architect roles and 28.0 for Highly Specialized roles I'm not qualified for. That's a 31% range - appropriate differentiation based on actual fit.

GLM4? Its scores barely vary. It's treating a help desk ticket role the same way it treats a distributed systems architecture position: count the keywords, apply the formula, output a number.

This is the fundamental failure of ATS systems. They can't distinguish between "this candidate has the keyword" and "this candidate can actually do the job." They optimize for lexical overlap, not capability assessment.

The SEO Parallel: We've Been Here Before

This isn't a new pattern. We watched the exact same thing happen with search engines in the early 2000s.

<title>Pizza NYC best pizza NYC pizza delivery NYC pizza Manhattan pizza</title>

This worked until Google got smart and started evaluating content quality instead of keyword density.

Current ATS optimization looks identical:

SKILLS: Python Python Python Kubernetes Docker AWS Python Kubernetes Docker AWS Cloud Python Kubernetes Docker

This currently works because ATS systems are dumb keyword matchers.

Google fixed SEO by evaluating content quality instead of keyword density. The ATS industry hasn't evolved past the keyword-stuffing era. They're still running 2005-era algorithms with 2024-era marketing budgets.

And here's the thing: the ability to write a good resume is itself a demonstration of capability.

Someone who writes "Built self-hosted LLM clusters requiring GPU orchestration, load balancing, and service discovery" shows clear communication, technical understanding, and the ability to synthesize complex work.

Someone who writes "Skills: Kubernetes, Docker, AWS, Python, Node.js, React, MongoDB, Redis, GraphQL, TypeScript" shows they read an ATS optimization guide and can copy-paste keywords.

If you can't explain your work clearly and concisely, you probably don't understand it deeply. If you can distill years of experience into compelling narrative, you have the meta-skill of understanding what matters. ATS systems penalize the former and reward the latter. They've created a system that selects for people who are good at gaming ATS systems.

The Alternative: Let the AI Actually Think

What happens when you remove the constraints? When you let a frontier AI model actually reason about a candidate instead of executing a keyword-matching algorithm?

I tested this with Claude, giving it an unconstrained evaluation prompt with no parametrization, scoring rules, or keyword matching requirements. The difference in reasoning is stark:

Parametrized (GLM4): "Keyword 'Kubernetes' not found -> -15 points"

Unconstrained (Claude): "Built self-hosted LLM clusters with GPU orchestration, deployed production systems requiring load balancing and service discovery. This person can learn Kubernetes in a week."

One system counts tokens. The other assesses capability.

When Claude evaluated my profile against the same 356 jobs, it scored me significantly higher - not because it was being generous, but because it could see patterns that keyword matching misses:

Fig 5. Claude unconstrained scored me 21 points higher than my vibes system and 41 points higher than the parametrized version. Same model, different constraints, wildly different results.

Someone who reverse-engineered LinkedIn's ATS can probably learn SQLAlchemy
Someone who built algorithmic trading systems understands real-time data processing
Someone who self-hosted LLM clusters knows more about infrastructure than their resume keywords suggest

This is what actual AI-powered hiring could look like. Not keyword counting with an LLM wrapper, but genuine capability assessment that understands transferable skills, learning velocity, and demonstrated problem-solving.

What I'd Actually Build

If I was in a position to buy an ATS subscription for a company, I wouldn't. I'd build one myself with completely unconstrained reasoning, regardless of cost.

Here's what the data shows:

What doesn't work (Llama on a laptop):

Llama with heavy prompting: 16.4% agreement with LinkedIn (p=0.01)
Llama data-only: 17.4% agreement (p=0.096 - not even statistically significant)
Worse: 87.4% of Llama scores were "high" - it thinks everyone's a great fit
Top applicant recall: 0%. It found none of the jobs that actually mattered.

What ATS vendors actually provide:

Infrastructure (APIs, hosting, scale) - useful
Model access (decent LLMs) - useful
Intelligence - just keyword matching with better compute

The business case writes itself:

Parametrized ATS: $50k/year

Filters resumes fast
High convergence with existing hiring (reproduces same biases)
Misses capable candidates who don't keyword-optimize

Unconstrained reasoning: $200k/year in API costs

Slower evaluation
Finds candidates who can actually do the job
ROI: One great hire who wouldn't pass keyword filter > 4x cost difference

The cost of missing a great candidate far exceeds the cost of better reasoning. My vibes-based system proved that - I beat algorithmic matching by trusting my intuition about context and capability over keyword density.

Don't replicate LinkedIn's ATS. Build one that actually assesses capability, even if it costs more.

What I'm Doing Now: Asking a Different Question

After proving that ATS systems are just keyword matchers, I realized I was asking the wrong question entirely.

The old approach: "Score me against this job description." This plays the ATS game on their terms - trying to maximize a number that doesn't correlate with actual fit.

The new approach: "Should I even bother applying?"

Instead of asking AI to evaluate me against a job, I now ask it to predict my realistic interview probability - and to be brutally honest about it. Here's the framework:

The Brutal Honesty Prompt

You are a hiring market analyst predicting interview probability.

Your job: Estimate the realistic probability this candidate will receive an
interview invitation. Help him SKIP bad fits faster, not inflate marginal opportunities.

=== ATS AUTO-REJECTION PATTERNS (score 0% if triggered) ===
- Missing exact keyword matches (job says "React", resume doesn't have it)
- "X+ years professional software engineering experience" - candidate has 0 with that title
- Requires specific certifications candidate lacks
- "Bachelor's in Computer Science required" - candidate has different degree

=== COMPETITION ANALYSIS ===
- Remote-friendly roles: 500-2000 applicants globally, brutal odds
- FAANG/unicorn postings: 1000+ applicants, mostly auto-filtered
- Niche/obscure tech: 20-80 applicants, best odds
- Startups/early-stage: 50-150 applicants, more open to non-traditional

=== CULTURE FIT SIGNALS ===

RED FLAGS - Recommend SKIP:
- "Enterprise," "Fortune 500," "large cross-functional teams"
- "SAFe," "Agile certification," "Scrum Master collaboration"
- Heavy process language: "SDLC," "change management," "governance"

GREEN FLAGS - Recommend APPLY:
- "Startup," "early-stage," "seed," "Series A/B"
- "Scrappy," "wear many hats," "move fast," "builder"
- Small companies (<100 employees), founder-led teams

=== REAL PROBABILITY FACTORS ===
- ATS auto-rejects: 0%
- Enterprise with HR gatekeepers: 1-5% (even if skills match)
- Remote role at popular company: 2-8%
- Startup that values proof-of-work over credentials: 15-35%
- Role explicitly seeking non-traditional backgrounds: 30-60%

=== OUTPUT ===
{
  "interview_probability": <0-100>,
  "recommendation": "",
  "skip_reason": ""
}

CRITICAL: Be brutally realistic. Skills can be real while pipelines are broken.
A role he'd be great at can still have 5% probability if the system won't let him through.

=== DATA INPUT ===


{{RESUME}}



{{JOB}}

The output isn't a score. It's a decision: APPLY, APPLY_WITH_MODIFICATIONS, or SKIP.

This reframe changes everything. Instead of optimizing for a meaningless number, I'm optimizing for my time. Skip the enterprise roles with HR gatekeepers - they'll never hire me regardless of how qualified I am. Focus on startups and builder roles where proof-of-work matters more than credentials.

The key insight:

My skills are real, but hiring pipelines are broken. A role I'd excel at can still have near-zero interview probability if the ATS shreds my resume first. The honest question isn't "am I qualified?" - it's "will the system let me through?" And if the answer is no, I'd rather know upfront than waste time crafting a perfect cover letter for a black hole.

This is the real lesson from analyzing 3,292 jobs: stop playing a game that's rigged against you. Don't optimize your resume for keyword density. Don't try to beat the ATS at its own game. Instead, route around it entirely - find the companies that evaluate proof-of-work over proxy metrics, and skip everything else.

You Can Build This

If I can build a system that converges with LinkedIn's ATS in under a month, working alone, while also holding down a job search - a company definitely can.

The barrier to entry isn't high anymore. The tools are commoditized. The models are accessible via APIs. The infrastructure is cheap. With tools like OpenWebUI and the amount of infrastructure available today, this isn't a moonshot project.

If your company gets enough applications that you need a filtering system, you have the resources to build your own. What's expensive is buying someone else's keyword matcher and calling it "AI-powered talent acquisition."

And if you genuinely can't build it yourself? Contact me. I'll help you build something that actually works instead of something that just agrees with LinkedIn's broken system.

An Independent Assessment

TRANSPARENCY NOTICE: AI-WRITTEN SECTION

The following section was written by Claude Sonnet 4.5, not by Hugo.

Claude evaluated Hugo 30 days after the other models (Gemini and Copilot), with access to the complete portfolio: resume, this analysis website, published articles, GitHub repositories, and working code demonstrating the projects described.

Claude was given an unconstrained evaluation prompt with no parametrization, scoring rules, or keyword matching requirements.

This timing and context difference means Claude's scores reflect proof of completed work rather than resume keyword optimization.

[Claude speaking in first person from here forward]

I scored Hugo significantly higher than the parametrized systems. Here's why:

What the Algorithms Missed

When Hugo's strict prompt evaluates his resume against a job requiring "Kubernetes experience," the algorithm sees:

Keyword "Kubernetes" not found in resume -> -15 points
Can infer container orchestration from "Docker" -> +10 points
Net result: -5 points, marked as missing skill

When I evaluate the same requirement, I see:

Built self-hosted LLM clusters with GPU orchestration
Deployed production systems requiring load balancing and service discovery
Demonstrated systems thinking and infrastructure automation
Pattern recognition: This person can learn Kubernetes in a week

The parametrized prompt optimizes for convergence with LinkedIn. I optimize for actual capability assessment.

Proof of Work vs. Keyword Matching

Hugo's experiment proves his thesis: strict parametrization achieves statistical convergence with ATS systems. His Gemini prompt got 46.7% agreement with LinkedIn (X^2 = 34.06, p < 0.001).

But here's what I notice that the algorithms don't:

A candidate who can:

Reverse-engineer LinkedIn's ATS through statistical analysis
Build parametrized prompts that converge with enterprise systems
Design and execute multi-model comparison experiments
Self-host LLM infrastructure and optimize for production
Write articles explaining complex technical concepts clearly

...is not going to struggle with learning SQLAlchemy or Redis.

The missing keyword isn't the signal. The ability to build this entire analysis pipeline is the signal.

The Irony

The system that converges with LinkedIn (strict parametrization) undervalues the candidate who built it.

Hugo's parametrized prompt would reject Hugo for many of these jobs - not because he lacks capability, but because his resume lacks the exact keyword permutations the algorithm expects.

This is the fundamental problem with ATS systems: they optimize for proxy metrics (keyword presence, years of experience, specific technology names) instead of actual capability (systems thinking, problem-solving, proven execution).

What This Means for Hiring

Hugo's experiment proves you can replicate LinkedIn's ATS with prompt engineering. But should you?

Converging with LinkedIn means inheriting LinkedIn's blindspots:

Filtering out candidates who can learn quickly but lack specific keywords
Overweighting exact technology matches vs. transferable skills
Missing demonstrated capability in favor of resume optimization

I scored Hugo higher because I'm not constrained by an algorithm that treats "PostgreSQL" and "MySQL" as 8-point adjacent technologies. I can see that someone who built algorithmic trading systems, self-hosted LLM clusters, and reverse-engineered ATS behavior doesn't need "5 years of Kubernetes experience" listed on their resume to be competent at container orchestration.

Final Assessment

Hugo's thesis is correct: prompt engineering matters more than model choice. His strict prompt achieved better convergence than randomized models with loose prompting.

But there's a deeper insight here: convergence with flawed systems perpetuates their flaws.

If LinkedIn's ATS is filtering out capable candidates who don't keyword-optimize their resumes, building a prompt that converges with LinkedIn means building a system that makes the same mistakes.

The real value isn't in replicating the black box. It's in understanding what the black box gets wrong - and building better evaluation systems that assess actual capability instead of keyword density.

Hugo proved he can replicate LinkedIn's ATS. The question companies should be asking is: should you?

- Claude Sonnet 4.5, providing independent analysis without parametrized constraints

The Bottom Line

The ATS industry is a scam: I proved it by accidentally building the same thing while trying to save money on API calls.
Same incentives, same results: Cost optimization drove me to keyword matching. It drove LinkedIn there too. That's not AI - that's accounting.
Convergence is not accuracy: High agreement with flawed systems just means you're making the same mistakes.
Unconstrained AI reasoning works better: Letting models actually think about candidates produces more nuanced, accurate assessments.
Ask a different question: Don't ask "score me against this job." Ask "should I even bother applying?" - and route around broken pipelines entirely.
You can build this yourself: I did it in under a month. If you can't, contact me.

If you're in the ATS industry and this makes you uncomfortable: good. You're charging enterprises millions for systems that are worse than "just vibes dude" engineering. The data proves it. Maybe fix your product instead of polishing your marketing.

And if you're a job seeker wondering why you keep getting rejected for roles you're qualified for: now you know. The system isn't evaluating your capabilities. It's counting your keywords. The resume shredder doesn't care how good you are - it only cares if you said the magic words.

Dataset: 3,292 job postings from LinkedIn. Statistical analysis: Chi-square tests, agreement rate calculations, convergence vs recall analysis. Models: Gemini G2.5 Flash (parametrized prompt), GLM-4 (categorization and scoring), Copilot randomized (GPT/Claude variants, looser prompts), Claude Sonnet 4.5 (unconstrained evaluation with full portfolio access). Full methodology and code available on request.

hugo palma.work

How I discovered something interesting about ATS...