Testing

How to Test Claude Code Output Automatically (Quality Gates Explained)

12 min read • Apr 19, 2026 • DeepWork

You can't tell if Claude Code output is correct just by looking at it. Not reliably. Not at speed. The model is good enough that wrong output often looks right — it compiles, it passes the eyeball test, it has the right structure on the surface. The failure is underneath.

This is the trust gap in AI-assisted development: every Claude Code output is a hypothesis about what you wanted. Most of them are right. Enough are wrong to matter. And as output quality degrades over time, the wrong percentage creeps up while the output continues to look the same.

Quality gates are the automated verification layer that fills this gap. They run after every Claude Code step, evaluate output against explicit criteria, and either pass it forward or surface the failure with enough context to fix it. This article explains what they are, how they work, and how to implement them.

192

Workflow runs analyzed

919

Gate evaluations run

95%+

Pass rate with gates

The Trust Problem

When you write a unit test, you know exactly what correct output looks like. You define it in code and the test runner checks it every time. Failure is binary and immediate.

Claude Code outputs don't work that way. "Correct" is usually a distribution of acceptable outputs, not a single value. A code review might be correct even if it surfaces different issues than the one before it. A data validation step might be correct even if it produces slightly different field names than last run. The evaluation requires judgment, not just comparison.

That judgment is exactly what makes manual review feel necessary. And it's also what makes it impossible to scale. Human judgment is expensive, slow, and inconsistent. Once you're running Claude Code at any real volume, you need evaluation that doesn't require someone to read every output.

The silent failure mode: Claude Code's worst failures aren't crashes — they're subtly wrong outputs that look correct on first read. Context drift, schema variations, criteria that got softened by compaction. You catch the loud failures. The quiet ones accumulate downstream until something breaks in a way that takes hours to trace back.

The solution isn't to hire more reviewers. It's to define what "correct" means formally enough that software can check it — and then run that check automatically after every step.

Why Manual Review Doesn't Scale

At low volume, manual review works. You're running 5–10 Claude Code tasks per day, reading each output, catching the occasional miss. The overhead is manageable. The coverage is real.

At scale, this breaks down in three ways. First, attention degrades. After 15–20 outputs, a developer's ability to spot subtle quality issues drops sharply. Review becomes a checkbox. Second, it doesn't parallelize — the bottleneck is human attention, which is fixed regardless of how much automation you add around it. Third, it isn't systematic. You catch what you notice. Different reviewers catch different things. The same reviewer catches different things on different days.

The 100-task-per-day wall is where this becomes untenable. At that volume, meaningful manual review takes 3–4 hours daily. Most organizations can't afford that, which means they end up sampling (missing failures between samples) or rubber-stamping (catching nothing).

Quality gates replace human judgment with defined criteria. The criteria run consistently, in parallel, at zero marginal cost per additional task. You trade the flexibility of human review for the consistency and scale of automated evaluation. For production workflows, that's the right trade.

What Quality Gates Are

A quality gate is a post-step check that evaluates Claude Code output against a set of defined criteria and produces a pass/fail verdict with a score and failure reasoning.

The key components are:

Criteria — The specific properties the output must satisfy. Each criterion is a testable assertion about the output's content, structure, or correctness.
Evaluator — The mechanism that checks each criterion against the actual output. Can be rule-based (regex, schema validation, structural checks) or model-based (another AI evaluating the output against the criteria).
Score — A numeric quality signal (typically 0–100) derived from the weighted pass/fail of individual criteria.
Threshold — The minimum score required to pass the gate. Output below threshold triggers a retry or raises a failure.
Failure context — When a gate fires, the specific criteria that failed and why. This context is injected into the retry prompt so the model knows what to fix.

quality gate — anatomy of a definition

gate: check-code-review-output
threshold: 80
criteria:
  - id: issues_identified
    description: "Output identifies at least one specific code issue with line reference"
    weight: 30
    evaluator: model

  - id: severity_classified
    description: "Each issue has a severity label: low, medium, high, or critical"
    weight: 25
    evaluator: schema
    schema:
      required_enum: [low, medium, high, critical]

  - id: actionable_fix
    description: "Each issue includes a concrete suggested fix, not just a description"
    weight: 30
    evaluator: model

  - id: no_hallucinated_lines
    description: "Line references exist in the actual source file"
    weight: 15
    evaluator: assertion

When this gate runs against a Claude Code output, it checks each criterion, weights the results, and produces a score. If the score falls below 80, the gate fires — and the failure context (which criteria failed, why, and with what score) gets injected into the retry. The model knows exactly what to fix.

The Four Types of Quality Gates

Syntax Gates

Syntax gates check structure, format, and schema compliance. They're the fastest and cheapest gates to run — pure rule-based evaluation with no model call. If output is supposed to be JSON with specific fields, a syntax gate verifies that. If it's supposed to follow a naming convention, a syntax gate checks it.

Syntax gates catch the most common non-determinism failure: structural variation in output that breaks downstream pipeline steps. A code review that produces {"issues": []} on one run and {"findings": []} on the next looks fine to a human skimming it. It silently breaks any downstream step that keys on issues.

gate: syntax-check
type: schema
criteria:
  - id: valid_json
    check: "output is parseable JSON"
  - id: required_fields
    check: "output contains: summary, issues, severity, recommendations"
  - id: severity_enum
    check: "severity field value in [low, medium, high, critical]"
  - id: issues_array
    check: "issues field is an array"

Syntax gates should be on every step with structured output. They're near-zero cost and catch a category of failure that would otherwise flow silently downstream.

Semantic Gates

Semantic gates check meaning, not just structure. They answer: does this output say what it's supposed to say? Is the code review actually identifying real issues, or is it hallucinating? Does the data extraction accurately represent the source document?

Semantic gates require a model evaluator — you can't check meaning with a regex. The evaluator reads the output, cross-references it against the source material and the criteria, and scores each criterion based on whether the semantic requirements are met.

gate: semantic-check
type: model
evaluator_prompt: |
  You are evaluating a Claude Code output against quality criteria.
  Source material: {source}
  Output to evaluate: {output}
  Score each criterion 0-100 with reasoning.
criteria:
  - id: factual_accuracy
    description: "Claims in output are supported by source material"
    weight: 40
  - id: completeness
    description: "Output addresses all requirements in the original task"
    weight: 35
  - id: no_fabrication
    description: "Output does not introduce information absent from source"
    weight: 25

The cost of a semantic gate is roughly one additional model call per evaluated step. For production workflows, that cost is usually well worth it — semantic failures are the ones that look correct but aren't, and they're the hardest to catch manually.

Regression Gates

Regression gates compare current output against a baseline — a previously verified good output, a reference standard, or a historical average. They catch quality degradation over time: the slow drift that happens as context windows grow, sessions get stale, or model behavior shifts.

This is the gate that addresses the degradation problem directly. Without regression gates, quality can drop 12+ points over a series of runs and nobody notices because each individual output looks fine. With regression gates, the drift is quantified against a known baseline and surfaces as a gate failure before it compounds.

gate: regression-check
type: regression
baseline: ./baselines/code-review-v2.json
criteria:
  - id: score_no_regression
    description: "Current score within 10 points of baseline score"
    comparison: "score >= baseline.score - 10"
  - id: coverage_maintained
    description: "Issue count within 80% of baseline issue count"
    comparison: "issues.length >= baseline.issues.length * 0.8"
  - id: severity_distribution
    description: "High/critical issue ratio not below baseline ratio"
    comparison: "high_ratio >= baseline.high_ratio * 0.85"

Regression gates are especially valuable for scheduled, repeating workflows — daily report generation, weekly code health checks, recurring data validation pipelines. Each run is checked against the last known good run, so quality regression gets caught before it accumulates.

Assertion Gates

Assertion gates are the most targeted type: explicit checks for specific, verifiable facts. Unlike semantic gates (which evaluate meaning holistically), assertion gates check one thing at a time with a deterministic evaluator.

Good use cases: line references exist in the actual file, function names are spelled correctly, referenced URLs are reachable, extracted dates are valid, calculated values match expectations. These are the facts that model evaluators miss because they're checking meaning at a higher level — they don't catch a typo in a function name or a fabricated line number.

gate: assertion-check
type: assertion
assertions:
  - id: line_refs_exist
    check: |
      for each issue in output.issues:
        assert source_file.lines[issue.line_ref] exists
    weight: 20

  - id: function_names_valid
    check: |
      for each function_name in output.mentioned_functions:
        assert function_name in ast.parse(source_file).functions
    weight: 20

  - id: no_duplicate_issues
    check: |
      issue_ids = [i.id for i in output.issues]
      assert len(issue_ids) == len(set(issue_ids))
    weight: 15

Assertion gates are deterministic and cheap — no model call, just code. They catch a different class of failure than semantic or syntax gates, which is why mature gate stacks combine all four types rather than relying on any one.

Implementation Patterns

Gate placement: where to put them

The core principle: gate at every step boundary where bad output would propagate. Not just at the end of the workflow. Not just at "important" steps. At every step that produces output consumed by a subsequent step.

In our 919 gate evaluations, 31% of failures were caught at intermediate step boundaries. Those failures averaged 1.1 retries to resolve. The same failures caught only at final output averaged 3.7 retries — a 3.4× cost multiplier on the same underlying error, because downstream steps had built on the bad output and the failure had compounded.

Gates only at end — errors compound

Step 1: parse requirements ✓
Step 2: extract entities
  → wrong field types inferred
  → no gate, continues
Step 3: generate validation
  → builds on bad types
  → no gate, continues
Step 4: write tests
  → tests validate wrong types
  → final gate fires: 52/100

Trace root cause: 4 reruns
Fix: restart from step 2

Gates at every boundary — caught early

Step 1: parse requirements ✓
Step 2: extract entities
  → wrong field types inferred
  → gate fires: 61/100
  → "field types not validated
     against spec source"
  → single informed retry
  → corrected: 94/100 ✓
Step 3: generate validation ✓
Step 4: write tests ✓

Gate catches 1 error,
1 informed retry, done

Informed retries: how to use gate failures

A gate that fires but doesn't surface context is wasted. The value of a quality gate isn't the score — it's the failure reasoning that informs the retry.

When a gate fails, the retry prompt should include: the original task instructions, the specific criteria that failed, the evaluator's reasoning for each failure, and any source material relevant to the failed criteria. This gives the model a targeted correction instead of a blank re-attempt.

retry prompt construction with gate failure context

# Original step prompt
TASK: Extract all entity types from the provided API spec.

# Gate failure context injected into retry
PREVIOUS_ATTEMPT_FAILED:
  Gate: check-entity-extraction
  Score: 61/100

  Failed criteria:
    ✗ field_types_validated (score: 42/100)
      Evaluator: "Optional fields marked as required.
      Spec lines 14-31 define these as optional.
      Re-examine the 'required' array in the spec."

    ✗ nested_schemas_expanded (score: 58/100)
      Evaluator: "addressSchema is referenced but not expanded.
      Downstream steps need the full nested type definition."

  FIX: Re-examine spec lines 14-31 for optionality.
  Expand all $ref schemas fully before returning.

# Model now has a targeted correction, not just a re-run

Thresholds: how to set them

Start with 75–80 for new gates. Run 20–30 evaluations, look at the score distribution, and adjust. Gates set too high (90+) on criteria that are genuinely hard will retry constantly without improving. Gates set too low (below 70) don't catch the failures you care about.

Different criteria warrant different weights. Structural correctness (syntax, schema compliance) should weight heavily because failures propagate and are easy to fix. Style or nuance criteria can weight lighter because variation is acceptable and over-penalizing them increases retry cost without improving output quality.

How DeepWork Automates This

Implementing quality gates manually means maintaining gate definitions, writing evaluators, injecting failure context into retries, tracking scores over runs, and tuning thresholds. That's a lot of tooling to build before you get any value.

DeepWork is built around quality gates as the core primitive. You define your workflow in YAML — job steps, gate criteria per step, thresholds — and the runtime handles execution, evaluation, retry injection, and scoring automatically.

deepwork job definition — gates built in

# code-review.yml
name: code-review
skill: ./skills/code-review.md

steps:
  - id: analyze
    prompt: "Review {input.file} for issues"
    gate:
      threshold: 80
      criteria:
        - "At least one issue identified with line reference"
        - "Each issue has severity: low/medium/high/critical"
        - "Each issue includes actionable fix suggestion"
        - "No fabricated line references"

  - id: prioritize
    prompt: "Sort issues by severity and impact"
    gate:
      threshold: 75
      criteria:
        - "Issues ordered high-to-low by severity"
        - "Ordering rationale provided for each grouping"

  - id: summarize
    prompt: "Write executive summary of findings"
    gate:
      threshold: 70
      criteria:
        - "Summary covers critical and high issues"
        - "Actionable next steps included"
        - "Under 200 words"

Getting started takes under 2 minutes: install via Homebrew, write a SKILL.md defining your workflow's quality standards, and run your first gated job. Each step is evaluated automatically, failed steps retry with injected context, and you get a full gate score log at the end.

The learn loop: After a set of workflow runs, DeepWork's deepwork learn command analyzes which criteria triggered the most retries, extracts the correction patterns from successful retries, and updates your SKILL.md automatically. The next run starts from improved instructions that bake in what past failures taught. Workflows compound improvement over time instead of degrading.

Results from 192 Runs

We ran 192 Claude Code workflows through DeepWork and recorded every gate evaluation — 919 in total across 10 job types. The results quantify what quality gates actually deliver in production.

83%

Pass rate without gates

95%+

Pass rate with gates

3.4×

Fewer retries at boundaries

The 83% baseline is the first-run pass rate for workflows running without quality infrastructure — no gate evaluation, no informed retries, no regression tracking. At 83%, one in six runs is a failure you either catch manually or miss entirely.

With quality gates, YAML job definitions, and SKILL.md context management, the same workflow types averaged 95%+ first-run pass rates. The improvement isn't from the model getting better — it's from the workflow infrastructure changing the error distribution. Failures get caught at step boundaries instead of compounding to final output. Retries are informed instead of blind. Standards are enforced fresh each session instead of drifting with context.

Gate type	What it catches	Evaluator cost	When to use
Syntax	Schema violations, structural variation, missing fields	Near-zero (rule-based)	Every step with structured output
Semantic	Meaning failures, fabrication, incomplete coverage	One model call per step	Steps where content accuracy matters
Regression	Quality drift over time, cross-session degradation	Comparison (near-zero)	Repeating scheduled workflows
Assertion	Specific verifiable facts, line refs, function names	Near-zero (deterministic)	Steps with verifiable factual claims

The full best practices analysis from the 192-run dataset covers threshold tuning, gate stacking order, learn loop frequency, and how the quality improvement compounds across sessions. The short version: quality gates are not a nice-to-have for production AI workflows. At any real volume, they're what makes the difference between a workflow you can trust and one you're perpetually second-guessing.

Summary

Claude Code output can't be verified by reading it. Manual review doesn't scale. Quality gates are the systematic approach to automated verification — defined criteria, consistent evaluation, informed retries, and tracked scores across runs.

The four gate types cover the full failure surface: syntax gates for structural compliance, semantic gates for meaning accuracy, regression gates for drift over time, and assertion gates for specific verifiable facts. Stacked at every step boundary, they take 83% first-run pass rates to 95%+ and eliminate the 100-task-per-day manual review wall.

The implementation overhead is the gate definitions themselves. Once those exist, the rest is tooling — and DeepWork handles the tooling.

Add Quality Gates to Your Claude Code Workflows

DeepWork implements all four gate types automatically — install in 30 seconds, define your criteria in YAML, and every step gets verified on every run.

brew tap unsupervisedcom/deepwork
brew install deepwork

View on GitHub Join Early Access

Get early access for hosted workflows, team gate libraries, and scheduled runs:

You're on the list. We'll be in touch.

Read next:

Questions or feedback? Open an issue on GitHub.