Craft
How to Write Testable Acceptance Criteria
Turn vague acceptance criteria into runnable Gherkin Given-When-Then a tester can verify, plus measurable non-functional criteria like p95 latency under 500ms.
Testable acceptance criteria are conditions a tester can run as written, with a clear pass or fail and no interpretation in between. The reliable form is Gherkin's Given-When-Then: a starting state, an action, an expected result. "Handle errors gracefully" is not testable. "Given a revoked invite, When the user opens the link, Then they see a request-access screen" is.
The difference matters because most acceptance criteria fail silently. They read fine in the doc, everyone nods, and the gap between what you meant and what shipped only appears in QA, when it's expensive. This is a how-to. I'll walk the before-and-after of converting vague criteria to runnable ones, show how to write the non-functional half with real numbers, and point you at the review passes that catch the ones you miss.
What "testable" actually means
A criterion is testable when two people reading it would build the same test. That's the whole bar. If "gracefully" means a toast to you and a retry screen to your engineer, the criterion isn't a requirement. It's a mood.
The tell is whether the criterion can fail. "The page should be fast" can't fail, because nobody agrees on fast. "The page loads in under one second at p95" can fail, and that's what makes it useful: it gives QA a number to check you against. Every criterion you write should have a built-in way to be wrong.
The before: vague criteria that pass review and fail QA
Here are the four shapes I see most in real PRDs. Each reads like a requirement and specifies nothing.
- "Handle errors gracefully." Which errors? Graceful how?
- "The search should return relevant results." Relevant by whose judgment?
- "Users can reset their password easily." Easy is not a state a tester can verify.
- "The system should be performant under load." No number, no load defined.
None of these is checkable. An engineer reads "handle errors gracefully," shrugs, and builds whatever's quickest, which is almost never what you pictured. The criterion didn't constrain anything. It just looked like it did. This is one of the most common PRD mistakes, and it's near the top of the list because it hides so well.
The after: Given-When-Then
Gherkin gives you three keywords and a fixed order. Given sets the starting state. When is the single action under test. Then is the observable result. The official Gherkin reference defines the grammar, but you can write usable scenarios from those three words alone.
Take the worst offender and convert it. "Handle errors gracefully" becomes a set of named scenarios, one per error you actually care about:
Given a user with a revoked invite, When they open the shared link, Then they see a request-access screen, not a 404.
Given the payment API times out after 5 seconds, When the user submits checkout, Then the order is not charged, and they see a "try again" message with their cart intact.
Notice what happened. "Gracefully" forced you to name the error (revoked invite, API timeout), the trigger (open the link, submit checkout), and the exact result (request-access screen, cart intact). You can't write Given-When-Then and stay vague. The form does the work. And you've drafted the test as a side effect: QA reads the Then line and knows precisely what to assert.
A few rules that keep scenarios runnable:
- One When per scenario. If you need two actions, you have two scenarios. A scenario testing two things at once tells you nothing useful when it fails.
- The Then is observable. "Then the user feels confident" is not a result a test can read. "Then a confirmation banner appears with the order number" is.
- Use And to extend, not to smuggle. "Then X, And Y" is fine when both are observable results of the same action. It is not a place to staple a second test.
- Write the unhappy paths first. The happy path writes itself and nobody files a bug against it. The revoked invite, the expired session, the partial permission: those are your week-one bugs, so those are the scenarios worth your time.
If you want the cheaper version of this drill, the same Given-When-Then exercise is the core of how to critique your own PRD before anyone else reads it.
The half everyone skips: measurable non-functional criteria
Functional criteria say what the feature does. Non-functional criteria say how well it has to do it, and most drafts go silent here. That silence is where sprint estimates blow up: the engineer asks how many concurrent users this holds, the room turns to you, and you don't have a number.
The fix is the same as the functional fix. Replace the adjective with a measurement.
- Not "fast." Say p95 latency under 500ms at 1,000 concurrent users.
- Not "reliable." Say 99.9% of write requests succeed over a rolling 30-day window.
- Not "scalable." Say sustains 10,000 requests per minute with no error-rate increase above baseline.
- Not "secure." Say all PII encrypted at rest with AES-256, and no auth token logged in plaintext.
Why p95 and not average? Because averages hide the users you're hurting. If your average response is 200ms but p95 is four seconds, one in twenty requests is painfully slow, and the average told you everything was fine. Pick the percentile that matches the promise: p95 for "most users never wait," p99 when the tail is the product.
You don't have to invent these categories. The quality vocabulary engineers already use comes from ISO/IEC 25010: performance efficiency, reliability, security, compatibility, and the rest. Walk that list, and for each one that applies, attach a number. The full version of that walk is the non-functional requirements checklist. The point holds whichever framework you borrow: a non-functional requirement without a number is a wish.
Non-functional criteria can take Given-When-Then too, when there's a trigger:
Given 1,000 concurrent users on the dashboard, When each requests their activity feed, Then p95 response time stays under 500ms, and no request returns a 5xx.
That's a load test, written in English, that QA can hand straight to a performance tool.
Where these criteria get caught when you miss them
Writing good criteria by hand takes discipline, and discipline is the first thing to go at 6pm the day the doc is due. So the second line of defense is review. Two passes in Thinkr's 11-pass Critique exist for exactly this gap.
The Engineering Readiness pass checks that every requirement is specific enough to estimate and test. It flags the "handle errors gracefully" criteria, the missing non-functional half, the metrics with no number. If an engineer couldn't size it from the page, the pass marks it, because a requirement that can't be estimated isn't ready to build.
The Edge Case and QA pass goes after the scenarios you didn't write. It maps each user flow to its failure states across user, system, data, and security, then checks whether each one has a criterion. The empty state, the expired session, the offline case: if a flow has no error path in the spec, that's not because it has none, it's because you stopped at the happy path. This is the pass that turns "I covered the main flow" into a list of the four branches you skipped.
Both passes return findings classified by severity (blocker, major, minor) and land each one as a comment on the line it concerns, so the gap surfaces before your team opens the doc instead of in the standup about it. That severity ranking is the part most "review my doc" tools skip: they hand you twenty equal suggestions with no position on which one actually sinks the build. A real review takes a position. (More on why that distinction matters in the difference between an AI PRD reviewer and a writer.)
The one-paragraph version
Write acceptance criteria a tester can run without asking you a question. For behavior, use Given-When-Then: a starting state, one action, an observable result, and write the unhappy paths first because those are your real bugs. For quality, replace every adjective with a number: not "fast" but p95 under 500ms, not "reliable" but 99.9% over 30 days. Then let the Engineering Readiness and Edge Case passes catch the ones you missed, because you will miss some, and the only question is whether it happens in review or in production. If you want the standing version of these checks, it's in the PRD review checklist.