Back to blog
Scale

Why Claude Code Fails at Scale (And How Quality Gates Fix It)

12 min read Apr 15, 2026 DeepWork

Claude Code is brilliant for one task. Try running 50 in a row — you'll see the cracks.

This isn't a Claude problem specifically. It's a structural problem with how most people use AI coding tools: as a smart clipboard. Each run is isolated, undocumented, and unvalidated. That works fine when you have one task. At scale, it compounds into something that looks like a reliability problem but is actually a systems problem.

We ran 192 Claude Code workflows through DeepWork and recorded every output against explicit quality criteria — 919 gate evaluations in total. The failure modes are consistent, measurable, and fixable. Here's what the data showed.

192
Workflow runs analyzed
919
Quality gate evaluations
3
Distinct failure modes

The Scale Problem in Plain English

When you run a single Claude Code task, the failure rate is low and the failure cost is low. You get bad output, you notice, you re-run. Total time wasted: maybe 10 minutes.

When you're running production-grade, repeatable workflows — code reviews on every PR, daily report generation, data validation pipelines, test suite maintenance — the math changes. A 15% failure rate at 10 tasks per day means one or two failures you might catch. At 50 tasks per day, that's 7–8 failures daily. Most of them silent, because Claude Code's failure mode at scale isn't loud errors. It's subtly wrong output that looks correct.

The 100-task-per-day wall is where manual review stops being viable. You can't read and assess 100 AI-generated outputs per day. The brain check that works for 5 tasks breaks somewhere around 20. Above that, you're either sampling (and missing failures between samples) or rubber-stamping.

The real cost of scale failures: It's not the re-runs. It's the downstream work built on bad output. A flawed code review that gets merged. A report with wrong data that drives a decision. A test suite that passes but doesn't cover the edge case that matters. The cost multiplies with the depth of the pipeline.


The Three Failure Modes at Scale

1

Non-Deterministic Outputs

📊 ~40% structural match rate for repeated ad-hoc prompts vs. 100% for YAML-defined jobs

Run the same Claude Code task twice with the same prompt. You will get different output. Not always substantively different — but structurally different. Different field names. Different section headers. Different levels of detail in different areas.

For a one-off task, this is fine. You got a useful result; minor variation doesn't matter. For a pipeline where the output of step 3 feeds step 4, structural variation in step 3 means step 4 gets different input every run. The pipeline that worked yesterday breaks today — not because the model failed, but because it produced perfectly valid output in a slightly different shape.

In our data, repeated ad-hoc prompts produced structurally consistent output on roughly 40% of runs. The other 60% varied enough to break downstream processing. YAML-defined workflows with explicit output specifications achieved 100% structural consistency across the same task type.

Fix: YAML job definitions with explicit output schemas

Structured job definitions force you to specify what the output should look like — field names, required sections, format constraints. The model has a contract, not just a vibe. Non-determinism drops sharply. Pipelines stop breaking on variation.

before: ad-hoc prompt — output varies by run
Run 1 output keys: ["summary", "issues", "recommendations"]
Run 2 output keys: ["overview", "findings", "next_steps"]
Run 3 output keys: ["analysis", "problems", "suggested_fixes"]

# Downstream parser breaks on run 2 and 3
# No errors thrown — just wrong data
after: YAML-defined job — output shape enforced
outputs:
  review_result:
    fields:
      - summary       # required
      - issues        # required, array
      - severity      # required, enum: [low, medium, high, critical]
      - recommendations  # required, array

Run 1 keys: ["summary", "issues", "severity", "recommendations"] ✓
Run 2 keys: ["summary", "issues", "severity", "recommendations"] ✓
Run 3 keys: ["summary", "issues", "severity", "recommendations"] ✓
2

Context Drift

📊 23% quality degradation in sessions exceeding ~60k tokens

Context drift is Claude Code's most common silent failure mode. As a session grows, the model compresses earlier context to stay within limits. The information technically survives compaction, but nuance doesn't. Specific constraints, edge cases, and stylistic requirements get averaged out.

At scale, context drift operates on two levels. The obvious level is within a single long session: constraints specified early in the session quietly stop being enforced later. The less obvious level is across sessions: every new session starts blank. The pattern you established over 3 runs of a workflow last week doesn't carry forward to this week. You re-establish it implicitly, and it drifts slightly each time.

In 192 runs, sessions that crossed approximately 60,000 tokens showed a 23% average degradation in quality gate scores compared to equivalent tasks run in fresh, bounded sessions. The degradation pattern is non-linear: a cliff around the compaction threshold, not a gradual slope.

quality score decay by session depth
Token range      Avg gate score    Interpretation
─────────────────────────────────────────────────────
0 – 20k          94.1              No degradation
20k – 40k        92.6              Within noise
40k – 60k        88.7              Early compaction effects
60k – 80k        80.9              Significant drift
80k+             71.2              Unreliable — restart

Sample: 192 runs across 10 workflow types
Fix: SKILL.md files that load fresh context on every run

SKILL.md files externalize persistent knowledge. Instead of relying on a context window to remember your constraints, you write them into a versioned skill file that loads fresh at the start of every session. The session stays shallow. The knowledge persists. Drift disappears because there's nothing to drift from — each session starts from the same clean state.

The cross-session version of this problem is why ad-hoc Claude Code usage slowly degrades over months. Each run subtly re-interprets the standards, and without a written-down definition to anchor against, nobody notices until the drift has accumulated into something obvious. SKILL.md files are the anchor.

3

Error Compounding

📊 3.4× more retries needed when errors are caught at final output vs. step boundaries

Multi-step workflows have a compounding problem that single-step tasks don't. An error in step 2 doesn't fail step 2 — it produces subtly wrong output that step 3 consumes as valid input. Step 3 then builds on the error. Step 4 builds on that. By the time you review the final output, you're looking at a failure that's three layers deep, and fixing it means untangling which step introduced the problem.

This compounds with the non-determinism problem. If step 2 varies between runs and there's no check on its output before step 3 runs, you get a combinatorial explosion: different errors compounding in different ways on different runs. Debugging any specific failure becomes a small investigation.

In our 919 gate evaluations, 31% of failures were caught at intermediate step boundaries before they propagated. Those failures averaged 1.1 retries to resolve. Equivalent failures caught only at the final output averaged 3.7 retries — a 3.4× multiplier on the cost of the same underlying error.

No step gates — error compounds
Step 1: parse spec ✓
Step 2: extract schema
  → produces wrong field types
  → no check, continues
Step 3: generate validation
  → builds on bad schema
  → no check, continues
Step 4: write test suite
  → validates wrong types
  → looks correct, passes

Final review:
  Tests pass but miss real bugs
  4 reruns to trace root cause
Step gates — error caught early
Step 1: parse spec ✓
Step 2: extract schema
  → produces wrong field types
  → gate fires: score 61/100
  → "field types not validated
     against spec source"
  → retry with failure context
  → corrected on retry ✓
Step 3: generate validation ✓
Step 4: write test suite ✓

Gate catches 1 error,
1 informed retry, done

The fix here is structural: gates at every step that feeds into another step. Not at the end — at each boundary where bad output would otherwise propagate silently downstream.


Why Manual Review Doesn't Scale

The natural response to unreliable AI output is manual review. Read the output, assess quality, re-run if needed. This works. At low volume, it's actually fine.

The problem is the 100-task-per-day wall. Human review capacity for AI output tops out somewhere around 20–30 tasks before attention degrades to the point of rubber-stamping. A developer doing 30 minutes of Claude Code output review per day can meaningfully assess perhaps 15–20 artifacts. Above that, review becomes a checkbox.

Even at volumes that seem manageable, manual review has a second problem: it's not systematic. You catch what you notice. You miss what looks right on first read but is subtly wrong. The silent failure modes from context drift and error compounding are specifically the kind of failures that look correct on a fast read.

Manual review scales linearly with task volume. Quality gates scale at essentially zero marginal cost. Every task gets the same evaluation against the same criteria, regardless of volume.


How Quality Gates Fix Each Failure Mode

The data from 192 runs shows a consistent pattern: each failure mode has a corresponding structural fix, and the fixes compound just like the failures do — except in the right direction.

Failure mode Without quality gates With quality gates
Non-deterministic outputs ~40% structural consistency; pipelines break on variation 100% structural consistency; YAML contracts enforce output shape
Context drift 23% degradation above 60k tokens; standards erode across sessions SKILL.md files reload fresh context every run; no drift accumulation
Error compounding 3.4× more retries; failures cascade to final output 31% of errors caught at boundaries; 1.1 retries average with informed feedback

The aggregate effect: workflows without quality infrastructure averaged 83% first-run pass rates across our 192 runs. Workflows with structured gates, SKILL.md definitions, and step-boundary validation averaged 95%+. That's a 12-point improvement. At scale, the gap between 83% and 95% is the difference between manageable and unmanageable.

The key insight from our data: Quality gates don't just catch errors — they change the error distribution. Instead of occasional large failures that require significant rework, you get frequent small failures that resolve in a single informed retry. The total cost of failures goes down even though you're evaluating more outputs. You're trading expensive late-stage failures for cheap early-stage corrections.

What "informed retry" means in practice

When a quality gate fires, DeepWork doesn't just re-run the step. It injects the specific failure reasoning into the retry prompt — which criteria failed, why the evaluator scored them low, and what the output got wrong. The retry model has a targeted correction, not a blank re-attempt.

This is why step-boundary gates have a 91% single-retry resolution rate in our data. Blind retries don't converge on correct output because they don't have information about what made the original output wrong. Informed retries do.

informed retry injection — actual gate failure context
Gate evaluation: check-schema-extraction
Score: 58/100

Failed criteria:
  ✗ "All field types validated against spec source" (score: 42)
    → Evaluator note: "Optional fields marked as required.
       cross-reference with spec lines 14-31 not performed."
  ✗ "Nested object schemas fully expanded" (score: 61)
    → Evaluator note: "addressSchema referenced but not expanded.
       downstream steps will have incomplete type data."

Injecting failure context into retry...
Retry prompt includes:
  - Original task instructions
  - Failed criteria with evaluator notes
  - Specific lines in spec to re-examine

Retry result: score 94/100 ✓

The Compounding Effect Works Both Ways

Error compounding is destructive. But the same compounding dynamic works constructively when you add learn loops.

After a set of workflow runs, DeepWork's learn command analyzes which criteria triggered retries most frequently, extracts the patterns from successful corrections, and updates the SKILL.md files automatically. The next run starts from improved instructions that bake in the lessons from previous failures.

In our data, workflows in their first learn cycle averaged 85% first-run pass rates. After two cycles on the same workflow: 93%+. After four cycles: near 96%. The improvement plateaus as you approach the practical ceiling for the task type, but the early cycles are steep.

This is the inversion of the scale problem. Without quality infrastructure, running more tasks at higher frequency produces more failures and more drift. With quality gates and learn loops, running more tasks at higher frequency produces better quality over time — because the system learns from each run.


Summary: The Scale Gap

Claude Code doesn't fail at scale because the model degrades. It fails because the workflows around it aren't designed for scale. No output contracts means non-determinism breaks pipelines. No persistent knowledge means context drift erodes standards across sessions. No step validation means errors compound into expensive late-stage failures.

Each failure mode has a direct fix: YAML job definitions for non-determinism, SKILL.md files for context drift, step-boundary quality gates for error compounding. Together they take 83% first-run pass rates to 95%+, eliminate the 100-task-per-day manual review wall, and turn repeated runs into an improvement mechanism rather than a reliability risk.

Problem Fix Impact
Non-deterministic outputs YAML job definitions with output schemas 40% → 100% structural consistency
Context drift SKILL.md files with session-fresh load Eliminates 23% token-depth degradation
Error compounding Step-boundary quality gates + informed retries 3.4× fewer retries; 83% → 95%+ pass rate
Manual review ceiling Automated gate evaluation at every step Zero marginal cost per additional task

Fix Claude Code Scale Failures with DeepWork

DeepWork implements all three fixes: YAML job definitions, SKILL.md context management, and step-boundary quality gates. Install in 30 seconds and run your first gated workflow.

brew tap unsupervisedcom/deepwork
brew install deepwork
View on GitHub Join Early Access

Get early access for hosted workflows, team skill libraries, and scheduled runs:

You're on the list. We'll be in touch.


Read next:

Why Claude Code Output Quality Degrades (And How to Fix It) →
Claude Code Best Practices from 192 Workflow Runs →
How to Build Repeatable Claude Code Workflows with Quality Gates →
5 DeepWork Workflows That Replace Manual Claude Code Babysitting →

Questions or feedback? Open an issue on GitHub.