HackerRank's open ATS scores your résumé by dice roll
The same PDF swings from 66 to 99 across runs, and the reason isn't a bug you can prompt away.
HackerRank quietly open-sourced the guts of an AI résumé screener, hiring-agent, and the most useful thing about it has nothing to do with hiring. It's a fully readable, runnable example of the thing thousands of companies are now doing behind a SaaS curtain: feed a PDF to an LLM, ask it to grade a human out of 100, and let a cutoff decide who a recruiter ever lays eyes on. Now that the prompts and weights are sitting in a public repo, you can actually run the experiment. And the experiment is damning.
The headline finding, documented by Dan Kinsky on his Dan Unparsed blog, is that the same résumé, same command produced 90, then 74, then 88, then 83. Put it in a loop for a hundred runs and the scores spread from 66 to 99. If a company sets its bar at 85, that identical résumé fails roughly 65% of the time. Nothing changed but luck. My read: this isn't a HackerRank screwup so much as a category error that almost every AI screening vendor is making right now, and open-sourcing the code is the best thing that's happened to the debate.
Why the same PDF scores 66 then 99
The pipeline is straightforward. Your PDF gets parsed to text, an LLM is called six times to pull structured fields (basics, work history, education, skills, projects, awards), it scrapes your GitHub profile and top repos for extra context, then everything goes back into the model for a final graded verdict. The default model is gemma3:4b at temperature 0.1, which is low enough that people assume it's effectively deterministic.
It isn't, and the reason matters. Temperature controls how aggressively the sampler reaches for less-likely tokens, so low temperature narrows the distribution. It does not pin it. And here's the part most people get wrong: even at temperature 0, production LLM inference frequently isn't bit-for-bit reproducible. Floating-point addition isn't associative, GPU reduction kernels change their summation order depending on batch size and how requests get grouped, and for mixture-of-experts models the routing can shift with batch composition. A reader-flagged GitHub issue from October 2025 shows the effect directly: six consecutive runs at temperature 0.2 returned 27, 34, 32, 34, 34, 30. Different sums, different argmax, different token, different score. You can't prompt your way out of arithmetic.
Swapping gemma3:4b for Gemini tightens things up, scores cluster between 48 and 64 instead of bouncing across a 33-point range. Better, genuinely. But a cutoff of 60 still rejects the same person 28% of the time. "More consistent" is not "consistent," and for a gate that decides whether a human ever reads your application, a one-in-four coin flip is not a rounding error.
There's a tempting alternative explanation worth shooting down. Plenty of commentary, including a dev.to post that walks through a generic ATS architecture, pins résumé-score volatility on the parsing layer: multi-column layouts, OCR misfires, taxonomy drift when you write "React.js" instead of "React." That's a real failure mode in keyword-era ATS systems, and it's worth knowing about. But it's not what's happening here. The input file never changed between runs. The instability lives in the model's judgment, not in the bytes going in.
The stable scores are useless and the useful scores are unstable
The genuinely interesting result is what happens when you break the 100 points apart. The scoring rubric weights it like this:
pie
title hiring-agent base score weights (plus up to 20 bonus)
"Open source contributions" : 35
"Personal projects" : 30
"Work experience" : 25
"Technical skills" : 10
Look at technical skills. It scored 8/10 in 98 of 100 runs. Rock solid. Why? Because skills are a checklist. You know React or you don't, the model is doing string matching dressed up as judgment, and there's nothing to be inconsistent about. Stable, but it tells you almost nothing about whether someone can engineer.
Projects, which carries 30 points and a detailed rubric with examples, is the noisiest category in the whole system. Across runs the same projects get described as lacking "architectural complexity" in one pass and demonstrating "real-world deployment" in the next. This is the classic LLM-as-judge problem, and it's well documented in the eval literature: models exhibit verbosity bias, position bias, and low self-consistency on open-ended quality judgments. The more genuinely qualitative the call, the less reproducible the answer. A detailed rubric didn't save it.
Then there's experience, and this is the part that should end the conversation. Work experience scored 25/25 on every single run. Sounds great, until you read the prompt that produces it. The entire instruction is two lines: analyze the work and volunteer sections for real-world experience, with a special-consideration note to hand extra points to founders and early-stage startup engineers. No rubric. No anchors. Nothing that says what earns a 15 versus a 25. So an engineer with one internship gets 25/25. A principal with ten years of distributed systems gets 25/25. Consistent, and completely unable to differentiate. (There's a bonus tell here: the evaluation template carries "Software Intern" on line one, undocumented and referenced nowhere else, yet re-running with an explicit senior prompt produced identical scores. The scoring dimensions are position-agnostic.)
That's the trap, stated plainly. The categories that are stable are stable because they're trivial. The category that actually requires judgment is the one that swings 30 points. You cannot have both reliability and discernment from this design, because the reliability comes from not judging. This is Goodhart's Law with a temperature knob: the moment you turn "is this engineer good" into a scalar target, you stop measuring the thing you cared about.
And the weighting compounds it. Sixty-five points out of a hundred ride on open source plus personal projects. Some of the best engineers around have shipped systems that never touched a public GitHub. Weight it this way and an engineer who built something like S3 loses to two internships and a tidy side project before a human reads a word.
What to actually do with an LLM in screening
If you have any say in how your team screens résumés, the practical takeaway isn't "AI bad." It's that you're using the model for the one job it can't do.
LLMs are excellent at the boring parts of this pipeline. Use them there:
- Parsing and structuring. Turning a messy PDF into clean JSON fields is exactly what they're good at. The six extraction calls in this repo are the legitimate use.
- Hard, checkable facts. Does this person know Python? Did they ship in a regulated environment? Binary, verifiable questions are stable and fair.
- Surfacing, not gating. Use the model to pull out candidates worth a look, never to auto-reject below a number.
What to refuse:
- Any single scalar gate. If your workflow rejects everyone under 85, you are rejecting qualified people by RNG. Before you ship a cutoff, do what the blog post did: run the same résumé 50 to 100 times and measure the spread. If the standard deviation crosses your threshold band, the gate is noise.
- LLM judgment on qualitative worth. "Is this experience worth 18 or 24 points" is a vibe check, and vibe checks are precisely what structured hiring, rubrics, and bar-raiser programs spent two decades trying to kill.
There's also a quieter consequence of putting the rubric in public: it's now trivially gameable. When the weights and prompts are visible, optimizing a résumé for the parser instead of the reader becomes a deterministic exercise. The keyword-stuffing arms race of the old ATS era (white text, hidden skill lists) just gets an LLM-shaped upgrade, where you tune for semantic proximity to the rubric rather than literal string matches. A scorer that can be reverse-engineered will be.
Transparency here is a feature, not the embarrassment it looks like. Most candidates rejected by a black-box screener never learn that a 4-billion-parameter model on someone's laptop flipped a coin on their career. This repo lets you prove it. Treat the scalar as a search aid, keep the parsing, throw out the gate, and put a human back in the loop for the judgment calls. A tool that can't differentiate isn't filtering for quality. It's just filtering.
Sources & further reading
Lenn writes about cloud platforms, Kubernetes internals, and the infrastructure decisions that quietly make or break engineering organizations. Based in Berlin's vibrant tech scene, they have a talent for turning dense platform-engineering topics into prose that people actually finish reading.
Discussion 2
need to dig into this, sounds like a black box problem
@contrarian_kat yeah that's exactly what it seems like - a black box that's making these huge swings in scores, i'm curious to see if anyone can actually figure out what's causing the variance