I built a system to beat ATS, and found scams and fake jobs on the other side

Two months of building before I applied to a single job. Zero web experience in December. By February I'd shipped a scraper, a scoring engine, a CMS, a PDF resume generator, and a browser automation framework (h17-webpilot) that moves like a human because every existing tool left fingerprints. What the tutorial calls a todo app, I apparently call job search infrastructure. Eventually I ran out of things to build. So I pressed submit.

Callbacks came in. So did a WhatsApp bot wanting an audio interview on a Saturday morning, a premium resume service pitch three hours after I applied, and a job listing that turned out to be a data collection form with an apply button on top. My pipeline to beat the ATS had deposited me into a market full of things that weren't jobs.

That made me angry. This article is about that anger.

What's Actually Out There

LinkedIn sells job listings. That's a core part of their revenue model. Like any ad platform, they're incentivized to maximize inventory. Problem is, job listings and ads are functionally indistinguishable from the applicant's side.

Ghost jobs are well-documented. Positions listed for months with no intent to hire, sometimes because a manager got headcount approved and HR posts the role as a formality while they already know who they're promoting, sometimes purely to build a candidate pipeline for future roles. Studies from 2023 and 2024 consistently put ghost job rates between 40% and 70% depending on sector. Some estimates go higher.

But ghost jobs are the boring version of this problem. Some listings aren't jobs at all. They're lead capture. You apply, you get an email, and the email is a pitch. A paid course. A resume review service. An "exclusive talent network" with a monthly fee. Every part of the flow looked like a real job posting. What came out the other end was a sales funnel. LinkedIn has policies against this. Those policies appear to be enforced with the same energy they apply to spam connection requests.

Then there's the one I can't fully prove but can't stop thinking about. Tech companies building AI have an obvious need for resume data at scale: hundreds of thousands of structured documents, work histories, skill sets, education, industry language. Training data for models that need to understand professional context, rank candidates, parse documents. A job listing is a perfect collection mechanism. Post a role, real or not, applicants submit polished resumes through a clean intake form, and you get labeled data for free. Labeled because the applicants self-selected as relevant to that role. I have no proof. But if you were building that dataset, this is exactly how you'd do it, and the incentives align too cleanly for me to dismiss it entirely.

So the gate I built the bypass for? Real. A lot of what's behind it isn't.

How HR Made It Worse

HR had a real problem: too many applicants, not enough time to read them. ATS was the answer. It worked. Filtered volume down to something manageable. But it changed the incentive structure for applicants. If a keyword scanner gates every application, you optimize for keywords. If AI screening ranks candidates, you optimize for AI ranking signals. If the threshold to apply drops low enough that volume costs you nothing, you apply to everything. ATS tried to solve filtering with automation. People automated back. And it didn't just attract more applicants; it attracted scammers and data collectors who realized the same pipeline that delivers resumes to recruiters delivers them to anyone willing to post a listing.

Each round of automation on the hiring side pulled more automation on the applicant side. Volume exploded, which made the screening problem worse, which justified more automation, which made applications cheaper and easier to generate, which increased volume further. Remote-friendly roles now see 500 to 2,000 applicants globally. FAANG postings see over a thousand, most auto-filtered before a recruiter sees them.

HR reduced the cost of screening and accidentally reduced the cost of applying too. Now they have both problems plus the arms race. ATS surfaces a ranked list of candidates that all look identical because they were all optimized against the same criteria. The signal ATS was built to extract is gone. And the vacuum left by removing humans from the process got filled by exactly what you'd expect: scam economy. Ghost listings, lead-gen funnels, premium upsell services, data harvesting operations, all growing in the space between desperate applicants and opaque systems, all extracting real money from the confusion, all hosted on a platform that takes a cut.

They built a system that optimized for volume and keyword matching. It got volume and keyword matching. Everyone appears surprised. And I am part of the problem, because I automated back.

When It Is Real, It's Still a Bot

One application the pipeline submitted got a response. A real company, a real role, flagged APPLY by the decision engine. Next morning, Saturday, a WhatsApp message. A bot was ready to interview me. Seven questions, audio responses, timed windows. No human involved at any stage.

I answered seven questions including a RAG pipeline design question I had never specifically studied. I reasoned from constraints in real time and independently derived hybrid search: staged keyword pre-filtering before semantic retrieval, which happens to be the production best practice the industry took years to converge on. I just didn't use the vocabulary. I never said "embeddings" or "vector database" or "chunking." I called my own approach "just me rambling."

What the bot sent back: praised my "technical knowledge," suggested I "structure answers to provide clearer summaries." A template. It received a first-principles derivation of a production architecture pattern and returned a form letter. The bot has no idea what it evaluated.

Even when the job is real, it's bots all the way down. Resume filtered by keyword matcher, first interview conducted by a bot that can't evaluate reasoning, evaluation returned as a template. Humans are several steps removed from every stage that was supposed to matter. I got through the ATS. I answered the bot's questions. And I still have no signal about whether a person is involved anywhere.

The Pipeline Running

My pipeline does this: scrape listings, run the SKIP/APPLY decision against each one, drop anything that doesn't pass signals for recency and legitimacy, pull the job description, generate a tailored resume against my base, compile to PDF with proper ATS formatting, submit through Easy Apply. End-to-end.

Full pipeline running live: LinkedIn scrape, SKIP/APPLY filter, resume generation, PDF compile, Easy Apply submission. Recorded in one take.

What it doesn't solve is whether there's anything real to apply to. It catches obvious fakes: no recruiter presence, vague description, posted six months ago. It doesn't catch a sophisticated lead-gen operation or a data harvesting play or a position that was filled internally the day it posted. That uncertainty is structural. No filter eliminates it.

What the automation changes is the cost of that uncertainty. If 40% of listings are worthless, I lose seconds per worthless listing instead of hours. That saved time goes into the roles where I think there's actually a human somewhere who might read my name. Those get proper attention. Automation isn't a replacement for judgment. It's what makes judgment affordable.

I Also Built the Scam

I want to be clear about something: I enjoyed building all of this. Reverse-engineering the ATS, proving a $50k product was a keyword counter, watching the automation run end-to-end. Genuinely fun. I'm not writing from bitterness. I'm writing from clarity.

And that clarity kept leading me to the same place: the barrier to running the scam I keep describing is trivially low. I kept writing about it and I wanted to be able to prove it, not just claim it. So I built one. A real working platform, deployed, with a live URL. I showed it to someone who works in recruiting and watched their reaction. They asked which VC-backed startup was behind it. How long did it take? One afternoon.

One afternoon. A three-layer PDF resume parser with OCR fallback. A structured extraction pipeline using a frontier LLM. A 1024-dimensional vector embedding pipeline. A pgvector cosine similarity search across 7,000 pre-embedded job listings. A dynamically generated radar chart with axes the LLM invents per candidate. A paywall with blurred job cards and a fake OAuth modal. A complete job listing page with a search bar that opens the paywall instead of searching. 2,493 deterministically generated company logos. A Docker image that seeds the entire Postgres database from a SQL dump at build time. Deployed on a $6 VPS with HTTPS.

A single Node.js/Express server and two HTML files. One Docker container. If you landed on it from a job board ad, you would hand over your resume without thinking twice.

~~It's live at tl.hugopalma.work. Upload a test resume and see for yourself.~~

I also ended up including this here, in my actual website, because I thought the feature was a good tech demo, and it reveals something I had not intended, what the machine thinks about our resume, we spend months on search queries and maybe our own CV is giving other signals.

What It Proves

It looks indistinguishable from a funded product. Neural matching engine, dynamic skill radar, structured extraction from any PDF format, semantic vector search against a real job database, a verification flow that autofills every field from your resume. A job seeker uploads their CV, sees matches that actually reflect their background, and hits a paywall asking them to create a free account to see the rest.

I didn't fake the matches. Vector search is real, embeddings are real, jobs are real. What's fake is the company, the brand, the "AI infrastructure" copy, the implied valuation. All the technical substance exists. The entity that built it is a single person who started web development two months ago, running it from a $6/month VPS.

Running this scam costs one afternoon and a few hundred dollars in cloud credits. Not engineering talent (I have that), but you don't need it. Hard parts are a PostgreSQL dump you can buy or scrape, an LLM API call you can make for pennies per resume, and a Tailwind template that makes dark-background dashboards look like Series B startups. Everything else is boilerplate.

That's what the ghost jobs, lead-gen funnels, and data harvesting operations I described at the top of this article actually are. They're this. A single server, a database of real job listings, a fake apply flow, and a paywall. What I built as a joke to prove a point is functionally identical to the infrastructure running the actual scam on actual job seekers right now. Only difference is intent.

And the matching actually works. If I swap the shuffled database for a real one, I have a better semantic job matcher than most of what's on the market. Those $50k ATS products are keyword counters. Mine does meaning-level similarity search across 7,000 embeddings with a 1024-dimension model, with axes the LLM invents per candidate, returning roles semantically closest to what you actually do. I built it in one afternoon as a prop for an article about scams. It's better than the real thing.

I built it to show it could be built. I'm showing you the code so the next time you upload your resume to a platform you've never heard of, you have a clear mental model of what you might actually be handing it to.

What You Can Actually Do

This system is broken at a level you can't fix individually. But you can stop being an easy target.

Stop using Easy Apply as a default. Easy Apply is a volume play. It's how you end up submitting your resume to ghost jobs, lead-gen funnels, and data collection operations without ever reading the listing. Platforms love it because it inflates their application metrics. Scammers love it because you don't even check what you're applying to. If a listing doesn't deserve five minutes of your attention, it doesn't deserve your resume.

Take your phone number off your resume. The WhatsApp bot that messaged me on a Saturday morning had my phone number because I put it on my resume. Every recruiter spam call, every "premium career coaching" pitch, every scam follow-up traces back to a phone number you gave away for free. Your email has spam filters. Your phone doesn't.

Spend 60 seconds checking the company. Does it have a real website? Can you find actual employees on LinkedIn? Does the job description read like a human wrote it? My platform at tl.hugopalma.work would survive a casual glance. It wouldn't survive someone actually checking whether the company exists. Most scams are the same: built to fool speed, not scrutiny.

Pick your targets and go deep. Fifty tailored applications to companies you've actually researched will outperform 500 spray-and-pray submissions to whatever LinkedIn's algorithm surfaces. The arms race rewards volume. Don't play that game. Find the roles where a human might actually read your name, and make it worth their time.

None of this is complicated. The problem is that the system actively discourages all of it. Easy Apply exists to make you click faster. Job boards profit from volume. Every incentive pushes you toward exactly the behavior that makes you most vulnerable. Slowing down is the only move that costs the scammers something.

There Was No Gold

I started web development in December. By February I had shipped more working infrastructure than most teams ship in a quarter. And at the end of it, what I found on the other side of the gate was a system designed to route humans away from humans. Resume filtered by keyword matcher, first interview conducted by a bot, evaluation returned as a template, status update never sent. I built robots to get through their robots. Nobody involved in any of this is a person making a decision about a person anymore.

That's what the industry isn't ready for. Not me specifically. I'm one person with a side project and a blog, and I was having fun. But the pattern I represent: someone who can read a broken system, understand it technically, and build around it faster than the system can respond. The tools exist. The knowledge is public. Only thing that was ever keeping this contained was the cost of building it, and that cost just collapsed.

Fear isn't about whether I can compete. I can compete. What scares me is that I went through all of this (scraped 7,000 jobs, proved the system was broken, built the bypass, submitted applications, aced the bot interview) and at the end of it, I still don't know if any of it connects to anything real. A human somewhere. A role that exists. A decision being made by a person about a person.

ATS was supposed to solve a real problem. It did, briefly. Then applicants adapted, volume came back worse, and the industry doubled down on more automation instead of questioning whether the approach was working. Each side escalated. Nobody won. And what filled the vacuum was a scam economy that I just proved you can build in an afternoon.

They built a gold rush. I followed the map. There was no gold.

Under the Hood

Everything above is the story. Below is how it actually works, for the people who want to see the guts.

The Stack

Node.js and Express on the backend. Two single-file HTML frontends, one for the landing and matching flow, one for job listings. PostgreSQL with the 7,000+ "real" (I shuffled them, changed company names, rewrote job descriptions, those are not my data to spread around) job listings I'd already scraped for the ATS research. Everything packaged into a single Docker image: Postgres 16 with pgvector, Node 20, and an LLM CLI tool. DB is seeded from a dump file at build time, so the container boots with data. One docker run, full platform.

Frontend is pure Tailwind CDN, Lucide icons, Chart.js for the radar chart. No build step. No framework. Readable HTML in one file, which made iteration fast enough that the "one day" claim holds up.

PDF Parsing Is a Lie

Resume upload is the centerpiece of the demo. You drop a PDF, a scan animation plays, and the platform surfaces your name, title, email, phone, skills, experience history, and a radar chart of your technical profile. It fills in a verification form. Looks like something that required millions in infrastructure to build.

Three layers to the parsing chain:

pdf-parse: Fast, handles text-based PDFs. Works for most resumes exported from Google Docs, Word, or a resume builder. Returns raw text.
pdftotext (poppler): Fallback for PDFs that pdf-parse chokes on, better at preserving layout structure, runs as a shell subprocess.
tesseract.js OCR: Last resort for image-based PDFs, scanned documents, photos of resumes, anything where the text isn't actually text. pdf2pic converts pages to PNG, tesseract reads them.

Problem I hit immediately: every resume parsing library assumes markdown-like structure inside the PDF. Real resumes don't have that. A resume exported from Canva is a canvas of positioned text boxes with no semantic structure at all. Raw text from a scanned CV is a single blob with line breaks in random places.

So I stopped trying to parse structure from the PDF and handed the raw text to the LLM instead. Extract text by any means necessary, dump it into a prompt, ask for structured JSON. That's the whole pipeline.

Getting the LLM to Behave

I used opencode as the LLM interface, a CLI tool I was already running locally that proxies to various model providers with no API key configuration in the app. Approach: spawn the CLI as a child process, pipe the resume text as stdin, parse NDJSON from stdout. My prompt asks for a specific JSON schema: name, email, phone, location, current title, years of experience, skills array, summary, structured experience and education arrays, a radar chart definition with dynamic axes and 0-100 scores per axis, and five suggested role titles for the vector search step.

First bug was fatal and took longer than it should have to diagnose. execFileAsync with a large stdin payload caused a SIGTERM mid-execution, the default buffer overflowing before the model could respond. Fix was a custom spawnWithStdin() function that keeps the process alive and streams stdin through properly:

function spawnWithStdin(cmd, args, input, timeoutMs = 120000) {
  return new Promise((resolve, reject) => {
    const child = spawn(cmd, args);
    let stdout = '', stderr = '';
    const timer = setTimeout(() => {
      child.kill();
      reject(new Error(`opencode timed out after ${timeoutMs}ms`));
    }, timeoutMs);
    child.stdout.on('data', d => stdout += d);
    child.stderr.on('data', d => stderr += d);
    child.on('close', code => {
      clearTimeout(timer);
      if (code !== 0 && !stdout) reject(new Error(`exited ${code}: ${stderr.slice(0, 200)}`));
      else resolve(stdout);
    });
    child.stdin.write(input);
    child.stdin.end();
  });
}

Second issue was key normalization. LLMs don't reliably return the same JSON field names across model versions or prompts. One run returns work_experience, another returns experiences, another returns workHistory. Solution: a fuzzy key matcher that checks an object's keys against a priority list of likely names and returns the first match:

function find(obj, ...candidates) {
  if (!obj || typeof obj !== 'object') return undefined;
  const keys = Object.keys(obj).map(k => k.toLowerCase());
  for (const c of candidates) {
    const match = keys.find(k => k.includes(c.toLowerCase()));
    if (match) return obj[Object.keys(obj).find(k => k.toLowerCase() === match)];
  }
  return undefined;
}

Radar chart was the most interesting output. Instead of hardcoding axes (Languages, Frontend, Backend, etc.), the LLM defines its own axes per resume. Upload a supply chain manager's CV and the radar shows "ERP Systems", "Procurement Automation", "Data Modeling". Upload a backend engineer and you get "API Design", "Distributed Systems", "Database". Reflects the actual profile, not a generic template. Looks more real because it is more real. The LLM is doing the interpretation, not a keyword list.

The Vector Search

Job matching is real. Not pretend-real, actually real. The LLM extracts five suggested role titles from the resume. Those get embedded using multilingual-e5-large-instruct, a 1024-dimension model from HuggingFace. Query goes in as query: {roles joined}; each stored job goes in as passage: {role}\n{stripped description}.

Embeddings are stored in PostgreSQL using pgvector. Similarity search uses the <=> cosine distance operator. What comes back is semantically closest to what the resume suggests the person actually does. Not keyword matches. Actual meaning-level similarity.

SELECT j.*, 1 - (e.embedding <=> $1::vector) AS similarity
FROM job_embeddings e
JOIN jobs j ON j.id = e.job_id
WHERE 1 - (e.embedding <=> $1::vector) >= 0.5
ORDER BY e.embedding <=> $1::vector
LIMIT 6;

First version fetched all embeddings into Node memory and computed cosine similarity in a loop. 7,000 vectors at 1024 dimensions each is about 28MB of floats on every resume upload. The database has an ivfflat index. Using it was the obvious move in retrospect.

HuggingFace Inference API broke partway through development. Old endpoint at api-inference.huggingface.co was deprecated, silently returning 410s. New endpoint is router.huggingface.co/hf-inference/models/{model}. Two-line fix that took twenty minutes to find.

The Paywall

There is no auth. There are no accounts. The entire signup flow is a modal with GitHub and Google buttons that do nothing. Paywall is localStorage.getItem('jobViews') counting to five, then showing a lock screen. That's it.

After resume parse: three job cards visible, company logo, title, location, a JD snippet. Below them, more cards blurred with a lock icon overlay and an "Unlock all matches" button. Click it, you hit the paywall modal. Every nav item, every search bar, every apply button opens the same modal. No path through the platform doesn't end at the signup gate.

Logos are 2,493 SVG files, one per company in the dataset, generated with a deterministic color hash and company initials. They look like placeholder brand assets. Convincing enough that most people don't notice they're generated.

In this series

I Reverse-Engineered ATS by accident from studying job listings, keyword matching proof, the APPLY/SKIP framework
I Let an AI Interview Me, Then Data-Analyzed My Own Answers - what the bot got right and what it missed entirely

ATS Tried to automate hiring, but got automated back