Craft
PRD evaluator score: how to read a 1-5 PRD grade
What a PRD evaluator score actually means, what each grade on a 1-5 scale signals, and where the launch-readiness line sits. Stop treating the number as a black box.
A PRD evaluator score is a single grade, usually 1 to 5, that rolls up a structured review of your spec into one number a reader can act on. A 5 is ship-ready. A 3 is the fence: substance is there, gaps remain. Anything at 2 or below shouldn't go to engineering yet. The score is only useful if you know the rubric behind it.
That last sentence is where most tools fail you. They hand back a number and no scale. You get a "4 out of 5" with no definition of what a 4 means, what would have made it a 5, or where the line is below which you shouldn't ship. The number looks rigorous. It's decoration until someone defines it. So let me define it.
What a PRD evaluator score is measuring
Start with what it is not. The score is not a measure of how polished the prose is, how confident the author sounds, or how many sections the template filled. A spec can be beautifully written and score a 2, because the score tracks one thing: how ready this document is to be built from without a follow-up meeting to fill the holes.
That reframe matters because of a trap most reviewers fall into. The instinct is to grade what's on the page, the sentences in front of you, and improve them. The work that moves a score is finding what's missing: the success metric with no baseline, the user flow with only a happy path, the non-functional requirement nobody wrote down. A good evaluator hunts for absence, then grades how much of it is left. Polish raises a score by a tenth. Closing a structural gap raises it by a point.
The 1-to-5 scale, defined
Here's the scale I use, and the one Thinkr's critique maps its verdict onto. Each grade is a launch decision, not a vibe.
5, ship-ready. An engineer could build from this without scheduling a clarifying meeting. The problem is framed before the solution. Every success metric has a baseline, a target, and a date. Acceptance criteria are testable. Edge cases and the unhappy paths are written as conditions, not hopes. No blockers, at most a couple of minor findings that don't gate the build. This grade is rare on a first draft, and it should be.
4, ship with edits. The spine is sound and the thinking holds, but there are major findings a reader has to resolve first: an ambiguous flow, one metric that can't fail, a non-functional requirement left for "later." Fixable in an afternoon by the author. You ship after the edits, not before.
3, the fence. The substance is real, the gaps are real, and which way it falls depends on the one or two things you fix next. A 3 has at least one blocker or a cluster of majors that compound. Most working drafts live here. A 3 is not a failing grade; it's a "not yet, and here's the shortest path to yes."
2, structural gaps. Multiple blockers. The problem may be framed loosely or not at all, the metrics are directional ("increase engagement"), the acceptance criteria are too vague to build. The author needs to rethink sections, not edit sentences. Sending a 2 to engineering means the clarifying meeting happens anyway, just later and more expensively.
1, not a spec yet. A premise, not a plan. It names a feature and assumes the rest. There's nothing to red-team because the claims haven't been made specific enough to be wrong. The fix isn't review; it's another drafting pass. Sometimes you get here honestly, with a real idea that hasn't been written down yet.
Notice the scale is asymmetric on purpose. The distance from 1 to 2 is a lot of writing. The distance from 4 to 5 is a few careful edits. The grades aren't evenly spaced in effort, and a tool that pretends they are is lying to you about how close you are.
How per-pass findings roll up into one number
The score doesn't come from a feeling. It comes from findings, and findings come from passes. A structured critique reads the spec several times, each time looking for one class of problem: is the problem framed, are the metrics falsifiable, is it ready for engineering, are the user flows covered, are the edge cases mapped, and so on. Thinkr runs eleven such passes. Each one returns findings, and each finding carries a severity: blocker, major, or minor.
The rollup is a function of severity, not a count. This is the part the black-box tools skip, so here's the rule that does the work:
- A blocker is a finding that, left in, means the team builds the wrong thing or can't build at all. A success metric that can't fail. A core flow with no error state. One blocker caps the score at 3. Two or more cap it at 2.
- A major is real friction that a follow-up meeting would otherwise resolve. Several majors with no blockers land you at a 4, or push you to a 3 if they cluster in one section and compound.
- A minor is a tightening, a clearer sentence, a missing definition. Minors alone never drag a 5 below a 4.
So the number is severity-weighted, not additive. Twenty minor findings is a 4 with some homework. One blocker is a 3 no matter how clean the rest reads. That weighting is the whole point: it stops a long list of small nits from outvoting the one gap that actually stops the build. Most "review my doc" tools do the opposite. They return twenty equal suggestions with no ranking and no position, because they're built to be agreeable, and an agreeable tool won't tell you which finding is the one that matters. A grade without ranked severity isn't a grade. It's a word count.
Where the launch-readiness line sits
If the score is going to drive a decision, it needs a threshold. Here's the one I hold to: 4 and up means proceed, 3 and below means don't ship yet.
A 4 ships after its edits land. A 5 ships. A 3 means the answer is "not yet," and the useful output isn't the number, it's the shortest list of fixes that would move it to a 4. Below that, the document goes back to drafting, not to engineering.
The threshold matters more than the exact number. A reviewer who tells you "this is a 3" and stops has done half the job. A reviewer who tells you "this is a 3, here's the one blocker holding it there, fix it and you're at a 4" has done all of it. The score points at the work. The verdict names it. (For why ending in a ranked verdict beats ending in a flat list, the longer argument is in the principles behind a rigorous critique.)
The pass that produces the number
In an eleven-pass critique, ten passes gather evidence. The last one, Synthesis and Final Verdict, is where the evidence becomes a decision. It reads every finding from the earlier passes, weights them by severity, lands on the 1-to-5, and writes the verdict: ship, revise, or rethink, plus the single most important thing to fix first.
That ordering is deliberate. You can't synthesize what you haven't gathered, and you can't grade honestly if you grade as you read, when the first impression colors everything after it. Separating evidence from verdict is how you keep the score from being a mood. It's also the skill PMs are never taught. We learn to write specs. We're never taught to red-team one, so the first real grade most drafts get is someone else's, in a review meeting, too late to be cheap to fix.
What the black box hides, as of June 2026
Here's where the tools stand. ChatPRD ($15/mo Pro) markets a "CPO-level review" and publishes no rubric, so a passing grade is a vibe you can't audit. PMPrompt names three vague areas to improve and stops short of a scale. Centercode's Draft Doctor flags failures inline without a full rubric or a named author behind the judgment. In every case the number, or the verdict, is asserted, not shown.
Thinkr is priced above ChatPRD, and the reason is the part you can see. The eleven passes are named. The findings are severity-classified. The rollup rule is the one above, and the Synthesis and Final Verdict pass shows its work: which findings set the grade, and what would move it. The whole stack runs on Google Gemini, and your spec is never used to train a model. You can disagree with the grade. What you can't do with a defined rubric is wonder where it came from.
A score you can't interrogate is worse than no score, because it borrows the authority of a number without earning it. Define the scale, weight by severity, name the threshold, and the grade starts doing real work: pointing at the one fix between you and a spec your team can build from. If you want the by-hand version of the same checks, start with the PRD review checklist or learn how to critique your own PRD. And if you want a second reader who shows the rubric, run the critique and read the verdict, not just the number.