Tutorial

Getting Started with DeepWork — Your First Quality-Gated AI Workflow

10 min read • Mar 29, 2026 • DeepWork

DeepWork is a CLI tool that adds quality gates to multi-step Claude Code workflows — so every step is validated before the next one runs, failed outputs trigger automatic retries, and your workflow definitions persist as plain SKILL.md files you can version and share.

This is a hands-on walkthrough: install, create your first workflow, add quality gates, and run a real example. You'll have a working quality-gated workflow in under 15 minutes.

Install

DeepWork ships as a Homebrew package. Two commands:

brew tap unsupervisedcom/deepwork
brew install deepwork

Verify it's working:

terminal

$ deepwork --version
deepwork 0.4.2

$ deepwork --help
Usage: deepwork <command> [options]

Commands:
  init    <workflow-name>   Scaffold a new workflow
  run     <workflow-name>   Execute a workflow
  learn   <workflow-name>   Run a learn cycle on recent outputs
  list                      List all local workflows

That's it. No Docker, no API keys to configure, no cloud accounts. DeepWork runs locally and calls Claude Code on your machine. The only requirement is an active Claude Code session (or ANTHROPIC_API_KEY set in your environment).

Your First Workflow

DeepWork workflows are defined by two files: a job.yml that describes the steps and wires them together, and one or more SKILL.md files that define what each step should do and what "good output" looks like.

Let's build a three-step workflow: Write a function → Add tests → Review for edge cases. A minimal but complete workflow that demonstrates every core concept.

Scaffold the workflow

deepwork init write-and-test

terminal

$ deepwork init write-and-test
✓ Created .deepwork/write-and-test/
✓ Created .deepwork/write-and-test/job.yml
✓ Created .deepwork/write-and-test/skills/
✓ Created .deepwork/write-and-test/skills/implement/SKILL.md
✓ Created .deepwork/write-and-test/skills/write-tests/SKILL.md
✓ Created .deepwork/write-and-test/skills/edge-case-review/SKILL.md

Edit job.yml to define your steps, then:
  deepwork run write-and-test

The scaffolded job.yml looks like this:

# .deepwork/write-and-test/job.yml
name: write-and-test
description: Write a function, add tests, review for edge cases

steps:
  - name: implement
    skill: implement
    input:
      task: "{{ inputs.task }}"

  - name: write-tests
    skill: write-tests
    input:
      implementation: "{{ steps.implement.output }}"

  - name: edge-case-review
    skill: edge-case-review
    input:
      implementation: "{{ steps.implement.output }}"
      tests: "{{ steps.write-tests.output }}"

The {{ inputs.task }} syntax is a template variable — you'll pass the actual task description at runtime. The {{ steps.implement.output }} syntax wires the output from the implement step directly into the write-tests step's input. No copy-pasting, no re-prompting.

Define the SKILL.md files

Each skill has a YAML frontmatter section that defines quality criteria, and a Markdown body that gives Claude its instructions. Here's the implement skill:

# .deepwork/write-and-test/skills/implement/SKILL.md
---
name: implement
description: Write a clean, production-ready function
quality_criteria:
  - Function handles null and undefined inputs explicitly
  - TypeScript types are present on all parameters and return values
  - No bare throws — errors return Result types or throw typed errors
  - Implementation is under 50 lines
---

You are implementing a function based on the task description.

Write a complete, production-ready implementation:
1. Define the function signature with full TypeScript types
2. Handle all edge cases including null, undefined, and out-of-range inputs
3. Use descriptive variable names — no single-letter variables except loop indices
4. Add a JSDoc comment explaining what the function does and its parameters

Output the implementation as a TypeScript code block only. No explanation,
no preamble — just the code.

The write-tests skill:

# .deepwork/write-and-test/skills/write-tests/SKILL.md
---
name: write-tests
description: Write comprehensive unit tests for the implementation
quality_criteria:
  - Happy path test present
  - Null/undefined input tests present
  - At least one edge case test (empty array, zero, boundary value)
  - Test descriptions are specific, not generic ("should work")
---

You are writing unit tests for the implementation provided.

Write tests using Vitest syntax (describe/it/expect).
Cover:
1. The primary happy-path case
2. Null and undefined inputs
3. At least two edge cases relevant to the function's logic
4. Any error paths if the function can throw or return an error

Output only the test code as a TypeScript code block.

And the edge-case-review skill:

# .deepwork/write-and-test/skills/edge-case-review/SKILL.md
---
name: edge-case-review
description: Review implementation and tests for missed edge cases
quality_criteria:
  - Review explicitly addresses concurrency if relevant
  - Review explicitly addresses large input handling
  - All uncovered edge cases are flagged with specific examples
  - A final score (0-100) is provided with reasoning
---

You are reviewing a function implementation and its test suite for edge cases.

Analyze both the implementation and tests together:
1. Identify any inputs that would cause incorrect behavior not covered by tests
2. Check for integer overflow, off-by-one errors, or empty collection handling
3. Note any assumptions the implementation makes that aren't enforced or tested
4. List uncovered edge cases with concrete example inputs

End your review with: SCORE: [0-100] — [one-sentence reasoning]

Quality Gates in Action

Without quality gates, step 2 runs regardless of whether step 1 produced good output. Add a quality_gate block to any step and DeepWork will evaluate the output before proceeding — retrying automatically if it doesn't pass.

Here's the job.yml with gates added:

name: write-and-test
description: Write a function, add tests, review for edge cases

steps:
  - name: implement
    skill: implement
    input:
      task: "{{ inputs.task }}"
    quality_gate:
      criteria:
        - "TypeScript types present on all parameters and return value"
        - "Null and undefined inputs handled explicitly"
        - "No bare throw statements"
      threshold: 85
      max_retries: 2

  - name: write-tests
    skill: write-tests
    input:
      implementation: "{{ steps.implement.output }}"
    quality_gate:
      criteria:
        - "Happy path test present"
        - "Null/undefined input tests present"
        - "At least one edge case test"
      threshold: 85
      max_retries: 2

  - name: edge-case-review
    skill: edge-case-review
    input:
      implementation: "{{ steps.implement.output }}"
      tests: "{{ steps.write-tests.output }}"

The gate evaluates the step's output against each criterion and produces a score from 0–100. If the score is below threshold, DeepWork retries the step up to max_retries times — passing the failure reasoning back into the next attempt so Claude can correct specifically what was wrong.

What a gate failure looks like

Gate failed — retrying

Step: implement
Score: 61/100

Failed criteria:
✗ TypeScript types present on all
  parameters and return value
  → found: (items) => { ... }
  → missing return type annotation

✗ Null and undefined inputs handled
  → no null check on `items` param

Retrying (attempt 2/2)...

Gate passed — proceeding

Step: implement
Score: 91/100

✓ TypeScript types present on all
  parameters and return value
✓ Null and undefined inputs handled
✓ No bare throw statements

Passing output to: write-tests

The key detail: when a gate fails, DeepWork passes the failure reasoning back into the retry prompt. Claude doesn't just re-run the same prompt — it sees exactly what was wrong and corrects it. First-run pass rates on well-defined workflows land around 83–87%. With retry logic, effective pass rates climb to 92–96%.

You're reviewing outputs that already cleared a quality bar — not outputs that might be fine or might have missed half your criteria. For more on why this matters, see why Claude Code output quality degrades without enforcement.

Real Example: Refactor a Legacy Function with Test Coverage

Here's a practical workflow you'd actually use. The task: take a legacy JavaScript function with no types and no tests, refactor it to TypeScript, and add test coverage.

The SKILL.md definition

# .deepwork/refactor-legacy/job.yml
name: refactor-legacy
description: Refactor a legacy JS function to TypeScript with tests

steps:
  - name: analyze
    skill: analyze-legacy
    input:
      code: "{{ inputs.code }}"

  - name: refactor
    skill: ts-refactor
    input:
      code: "{{ inputs.code }}"
      analysis: "{{ steps.analyze.output }}"
    quality_gate:
      criteria:
        - "All parameters have TypeScript type annotations"
        - "Return type is explicitly declared"
        - "Original logic is preserved — no behavior changes"
        - "Magic numbers replaced with named constants"
      threshold: 88
      max_retries: 3

  - name: add-tests
    skill: write-tests
    input:
      implementation: "{{ steps.refactor.output }}"
    quality_gate:
      criteria:
        - "Tests cover the happy path"
        - "Tests cover at least 2 edge cases from the analysis"
        - "Tests use the TypeScript types from the refactored version"
      threshold: 85
      max_retries: 2

Run it

deepwork run refactor-legacy \
  --code "$(cat src/utils/calculate-discount.js)"

terminal

$ deepwork run refactor-legacy \
    --code "$(cat src/utils/calculate-discount.js)"

Running: refactor-legacy
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[1/3] analyze
  ✓ Completed — 3 edge cases identified, 2 magic numbers flagged

[2/3] refactor
  Gate evaluation... score 74/100
  ✗ Magic numbers replaced with named constants → found: 0.15, 0.20 inline
  Retry 1/3 with failure context...
  Gate evaluation... score 92/100
  ✓ Passed

[3/3] add-tests
  Gate evaluation... score 89/100
  ✓ Passed

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✓ Workflow complete
  Steps: 3/3 passed
  Retries: 1 (refactor step, magic numbers)
  Output: .deepwork/refactor-legacy/output/latest/

The output lands in .deepwork/refactor-legacy/output/latest/ — one file per step, plus a combined summary. The retry on the refactor step caught the inline magic numbers that the first attempt missed. You didn't have to re-run anything manually.

The scored output

The final output directory includes each step's result and a summary.md with the gate scores and what each retry corrected:

# summary.md
Workflow: refactor-legacy
Run: 2026-03-29T14:31:07Z

## Steps

### analyze — ✓ PASSED (no gate)
- 3 edge cases identified: negative inputs, zero quantity, discount > 1.0
- 2 magic numbers flagged: 0.15 (standard rate), 0.20 (premium rate)

### refactor — ✓ PASSED (attempt 2/2, score 92)
- Attempt 1 score: 74 — magic numbers not replaced
- Attempt 2 score: 92 — DISCOUNT_RATE_STANDARD, DISCOUNT_RATE_PREMIUM extracted

### add-tests — ✓ PASSED (score 89)
- 5 tests written: happy path, negative input, zero quantity,
  discount > 1.0, discount = 0

The learn loop: After a few runs, execute deepwork learn refactor-legacy --last-n 5. DeepWork reviews what triggered retries and updates the SKILL.md files automatically — so the magic number check becomes baked into the quality criteria and the first-attempt pass rate climbs. Workflows that self-improve without manual skill file maintenance.

What to Build First

The fastest way to see value: convert the Claude Code workflow you've been running manually most often. The one where you find yourself re-prompting the same corrections, re-explaining the same standards, catching the same missed edge cases.

PR review workflows — Define your team's standards once in SKILL.md; every review hits the same bar
Code generation — Generate + test + review as a single gated sequence instead of three separate sessions
Documentation generation — Ensure generated docs always cover parameters, return values, and error conditions
Data transformation scripts — Validate that transformation logic handles nulls, empty arrays, and type coercions before you ship it

If you've run the same Claude Code workflow manually more than twice, it's worth converting. Setup time pays back on the third run. For the full case on why the manual approach breaks down under repetition, read DeepWork vs. Manual Claude Code Workflows.

Install DeepWork and Run Your First Workflow

Two commands to install. Your first quality-gated workflow in under 15 minutes.

brew tap unsupervisedcom/deepwork
brew install deepwork

View on GitHub Join Early Access

Or get early access for hosted workflows, scheduled runs, and team skill sharing:

You're on the list. We'll be in touch.

Questions or feedback? Open an issue on GitHub.