Best Practices

Claude Code Best Practices from 192 Workflow Runs

15 min read • Apr 8, 2026 • DeepWork

We ran 192 Claude Code workflows through DeepWork and recorded every quality gate evaluation — 919 in total. We tracked first-run pass rates, retry patterns, failure modes, and which structural choices correlated with consistent output quality.

Most Claude Code tips online are vibes-based. "Be specific in your prompts." "Break tasks into steps." Useful intuitions, but nothing you can measure. What follows is what the data actually showed.

192

Workflow runs

919

Quality gate evals

95%+

Pass rate (gated workflows)

Five practices separated the workflows with consistent 95%+ pass rates from the ones that churned through retries or produced output that failed silently.

Define Quality Gates Before You Write the Prompt

📊 83% → 95%+ first-run pass rate improvement when gates are defined before prompts

The single biggest predictor of output quality wasn't prompt length, model temperature, or context size. It was whether the success criteria were written down before the task ran.

When success criteria are vague — "write good tests," "document thoroughly" — Claude optimizes for the appearance of the task, not its substance. It writes tests that execute. It writes docs that are present. The output looks done from the outside but fails where it counts.

Workflows where quality criteria were explicit and machine-evaluated before the next step started averaged 95%+ first-run pass rates. Workflows relying on implicit standards averaged 83% — and the 17% that failed consumed the most time to diagnose because the failure mode was subtle.

The practical rule: before you write the task instructions, write the gate. If you can't articulate specific, falsifiable criteria, the task definition is incomplete.

What "specific and falsifiable" looks like

Vague — fails silently

quality_gate:
  criteria:
    - "Tests are comprehensive"
    - "Code is well-documented"
    - "Error handling is proper"
  threshold: 85

Specific — fails loudly

quality_gate:
  criteria:
    - "Every exported function has
       at least one test"
    - "Every function has JSDoc with
       @param and @returns"
    - "All async functions have
       try/catch, no silent .catch()"
  threshold: 85

Vague criteria get scored by another model that also optimizes for appearance. Specific criteria can actually be evaluated. The left side above produces gate scores of 82–88 regardless of actual output quality. The right side scores 94 when the criteria are met and 61 when they're not — which is what you want from a gate.

Batch Over Marathon Sessions — Context Windows Have a Cliff

📊 23% quality degradation in sessions exceeding ~60k tokens

Context compaction is Claude Code's most common silent failure mode. As sessions grow, the model compresses earlier context to stay within limits. The information technically survives, but nuance doesn't — specific constraints, edge cases, and style requirements get averaged out.

In our runs, sessions that crossed approximately 60,000 tokens showed a 23% average degradation in quality gate scores compared to equivalent tasks run in fresh sessions. The degradation was non-linear: small sessions were unaffected, and the cliff was steep rather than gradual.

The fix is batching, not prompting harder. Break work into discrete jobs, each with its own fresh context. Persistent knowledge lives in SKILL.md files that load at the start of each session — not in a context window that grows until it compacts.

Practical signal: If a Claude Code session has been running for 30+ minutes on a complex task and you start noticing it "forgetting" earlier constraints, you've likely crossed the compaction threshold. Don't re-prompt. End the session, write down what was learned, and start fresh with that knowledge in the task definition.

The session length vs. quality curve

quality gate scores by session token depth

Token range      Avg gate score    Notes
───────────────────────────────────────────────
0 – 20k          94.2              Consistent, no degradation
20k – 40k        92.8              Minor noise, within variance
40k – 60k        89.1              Early compaction effects
60k – 80k        81.3              Significant degradation
80k+             71.6              Unreliable — restart recommended

Sample: 192 runs, 919 gate evaluations

The implication: tasks you expect to be long (>40k tokens) should be broken into sub-tasks with explicit handoffs. Each sub-task carries only what the next step needs. Context stays shallow; quality stays high.

Use YAML Job Definitions, Not Ad-Hoc Prompts

📊 100% reproducible YAML-defined workflows vs. ~40% for equivalent ad-hoc prompts

Ad-hoc Claude Code sessions are not workflows. They're conversations. Every run is a fresh negotiation between you and the model about what "done" means, and it produces slightly different results every time. That's fine for exploration. It's a problem for anything you run more than once.

When we ran the same task as a structured YAML job definition versus a reconstructed ad-hoc prompt, the YAML version produced consistent output on 100% of runs. The ad-hoc version matched the expected structure on roughly 40% of runs — not because the model was wrong, but because the definition of "expected" was baked into the person running it, not the job file.

A structured job definition forces you to be explicit about inputs, outputs, and step boundaries. That explicitness is what makes it reproducible.

Anatomy of a reproducible job

# job.yml — the contract, not just the instructions
name: api-audit
summary: "Audit REST API for consistency and completeness"

step_arguments:
  - name: openapi_spec
    description: "Path to OpenAPI specification file"
    type: string

workflows:
  audit:
    steps:
      - name: parse-endpoints
        instructions: |
          Parse the OpenAPI spec and extract:
          - All endpoints with HTTP methods and paths
          - Required vs optional parameters per endpoint
          - Response codes and their schemas
          Output a structured endpoint inventory.
        inputs:
          openapi_spec: {}
        outputs:
          endpoint_inventory: {}

      - name: check-consistency
        instructions: |
          Analyze the endpoint inventory for:
          - Inconsistent naming conventions (camelCase vs snake_case)
          - Missing error response codes (400, 401, 403, 404, 500)
          - Undocumented response schemas
          - Endpoints with no parameter descriptions
          Output specific findings per endpoint.
        inputs:
          endpoint_inventory: {}
        outputs:
          findings: {}
        quality_gate:
          criteria:
            - "Every finding references a specific endpoint path"
            - "Missing error codes listed per endpoint, not globally"
            - "Naming inconsistencies cite both formats found"
          threshold: 88
          max_retries: 2

The job file is the source of truth for what the workflow does. Any team member, any machine, any day — same inputs produce the same type of output. That's not achievable with a prompt living in someone's clipboard history.

Implement Learn Loops — Quality Compounds, Not Decays

📊 85% → 93%+ first-run pass rate after 2 learn cycles on the same workflow

Most teams using Claude Code are stuck in a loop: run the task, get mediocre output, re-prompt, get slightly better output, repeat. Each run starts from zero. Nothing improves permanently.

Learn loops break this pattern. After each workflow run, DeepWork analyzes which criteria triggered retries, what failure reasoning was injected, and which corrections worked. It updates the SKILL.md files automatically — tightening criteria, adding missed edge cases, improving step instructions based on actual failure evidence.

The effect is real and measurable. Workflows running their first learn cycle had an average first-run pass rate of ~85%. After two learn cycles on the same workflow, that climbed to 93%+. After four cycles, it plateaued near 96% — near the practical ceiling for the task type.

learn cycle — post-run skill update

$ deepwork learn api-audit --last-n 5

Analyzing 5 runs (23 gate evaluations)...

Common retry triggers:
  → "Every finding references a specific endpoint path"
    Failed 3/5 runs — findings cited resource types, not paths
    Updating step instructions to be more explicit...

  → "Naming inconsistencies cite both formats found"
    Failed 2/5 runs — only cited the wrong format
    Adding example format to criteria description...

SKILL.md updates:
  ✓ check-consistency: 2 instruction clarifications
  ✓ check-consistency: 1 criteria description expanded
  ✓ parse-endpoints: 1 output format made explicit

First-run pass rate (last 5 runs): 81%
Projected next-run pass rate: 91%+

Learn complete.

The asymmetry matters: without learn loops, quality is stuck at wherever you started. With learn loops, each run makes the next run cheaper and more reliable. The workflow pays for itself after a handful of runs.

Run learn cycles on any workflow you use more than twice. The marginal cost is low (one extra command); the compounding effect over 10–20 runs is significant.

Validate Outputs at Every Step Boundary, Not Just the End

📊 31% of failures caught at step boundaries before they propagated to later steps

The obvious place to check quality is at the end of a workflow. You look at the final output and decide if it's good. The problem: by the time you're evaluating the final output, three intermediate steps have built on top of whatever was wrong in step 2. A bad schema analysis leads to a bad transformation leads to a bad validation report. Fixing the final output requires understanding what broke upstream.

In our 919 gate evaluations, 31% of failures were caught at intermediate step boundaries — before they had a chance to cascade. In those cases, the retry happened where the error originated, with specific failure context injected directly into the retry prompt. The correction was cheap: usually one retry, targeted fix, move on.

The same errors caught at the final output required either full reruns or manual correction of intermediate artifacts. On average, late-caught failures consumed 3.4x more retries than step-boundary failures.

Where to put gates vs. where not to

Step type	Gate?	Reasoning
Analysis/scope	Optional	Low-risk; errors are shallow and easy to catch later
Core generation (transform, write code, draft content)	Yes	Errors here compound into everything downstream
Verification/synthesis	Yes	Final quality check; catches gaps in coverage or accuracy
Formatting/output	Optional	Surface-level; rarely introduces correctness issues

The rule of thumb: put gates on steps that feed into other steps and where errors are hard to spot in the final output. Skip gates on steps where a failure is obvious or where the output is purely additive.

On retry injection: When a gate fails, DeepWork feeds the specific failure reasoning into the retry prompt — not just "try again." The retry knows exactly which criteria failed and why. This is why step-boundary gates have a 91% single-retry resolution rate in our data. Blind retries don't converge. Informed retries do.

What the Data Says About What Doesn't Work

Three patterns showed up consistently in the low-performing runs:

Over-prompting the gate criteria. Gates with more than 6 criteria per step had lower discrimination power. The model evaluating the gate was effectively averaging across too many dimensions, and borderline outputs clustered near the threshold. Keep gates focused: 3–5 sharp criteria beat 8 vague ones.
Setting the threshold too low to avoid retries. Teams that set thresholds at 70–75 to "reduce friction" saw lower end-to-end quality. A gate that doesn't fail isn't a gate — it's theater. Set thresholds where failures actually mean something (85–90 for most workflows).
Treating SKILL.md files as static documentation. Skills that were written once and never updated gradually fell behind actual workflow behavior. The teams whose workflows improved over time ran learn cycles. The ones that didn't update skills saw quality plateau or drift.

Summary

Practice	What it fixes	Data signal
Define gates before prompts	Silent quality failures	83% → 95%+ first-run pass rate
Batch over marathon sessions	Context compaction degradation	23% quality drop above ~60k tokens
YAML job definitions	Irreproducible ad-hoc outputs	100% vs ~40% reproducibility
Learn loops	Quality stuck at baseline	85% → 93%+ after 2 cycles
Step-boundary validation	Cascading failures	31% of failures caught before propagating

None of these are exotic. They're engineering practices applied to AI workflows: define your acceptance criteria, keep units of work small, externalize configuration, measure and improve, validate at boundaries. The data confirms they work.

Try These Practices on Your Own Workflows

DeepWork implements all five of these patterns: quality gates, step-boundary validation, YAML job definitions, and learn cycles. Install it in 30 seconds.

brew tap unsupervisedcom/deepwork
brew install deepwork

View on GitHub Join Early Access

Get early access for hosted workflows, scheduled runs, and team skill sharing:

You're on the list. We'll be in touch.

Read next:

Questions or feedback? Open an issue on GitHub.