Back to blog
Use Cases

5 DeepWork Workflows That Replace Manual Claude Code Babysitting

14 min read Apr 5, 2026 DeepWork

You know the pattern. Start a Claude Code session, give it a task, watch it go off the rails, re-prompt, correct, re-prompt again. For any non-trivial workflow, you're spending more time supervising the AI than you would doing the work yourself.

DeepWork eliminates the babysitting. You define the workflow once — steps, quality criteria, retry logic — and it runs autonomously. Each step is gated: output gets evaluated against your criteria before the next step starts. Failures trigger automatic retries with the failure reasoning fed back in.

Here are five production workflows you can copy, adapt, and run in under 10 minutes each. Each one replaces a manual process that currently eats your time.

Prerequisites: Install DeepWork (brew tap unsupervisedcom/deepwork && brew install deepwork) and have an active Claude Code session. That's it. If you haven't installed yet, the Getting Started tutorial walks through setup in 2 minutes.


1

Automated Code Review

The problem

You ask Claude to review a PR. It produces generic feedback — "looks good," "consider adding error handling" — missing your team's actual standards every time.

DeepWork solution

Codify your review standards in a .deepreview file and SKILL.md. Every review hits the same bar, every time, with specific line-level findings.

The job.yml

# .deepwork/code-review/job.yml
name: code-review
summary: "Review code changes against team standards with actionable findings"

step_arguments:
  - name: diff
    description: "Git diff or file path to review"
    type: string

workflows:
  review:
    summary: "Full code review with standards enforcement"
    steps:
      - name: scan-changes
        instructions: |
          Parse the diff. Identify:
          - Files changed and their types (source, test, config)
          - Functions added or modified
          - Lines added vs removed
          Output a structured change summary.
        inputs:
          diff: {}
        outputs:
          change_summary: {}

      - name: check-standards
        instructions: |
          Review each changed file against team standards.
          Check for:
          1. Missing error handling on async operations
          2. Untyped function parameters or return values
          3. Console.log statements left in production code
          4. Magic numbers without named constants
          5. Functions exceeding 40 lines
          Output specific findings with file:line references.
        inputs:
          change_summary: {}
          diff: {}
        outputs:
          findings: {}

      - name: synthesize-review
        instructions: |
          Produce a review summary:
          - Severity per finding (critical/warning/suggestion)
          - Grouped by file
          - Overall assessment: approve, request-changes, or needs-discussion
          - Max 3 action items for the author
          Format as markdown suitable for a PR comment.
        inputs:
          findings: {}
          change_summary: {}
        outputs:
          review: {}

The SKILL.md for standards enforcement

# .deepwork/code-review/skills/check-standards/SKILL.md
---
name: check-standards
description: Enforce team coding standards on changed files
quality_criteria:
  - Every finding includes a file path and line number
  - No generic feedback — each finding cites specific code
  - False positive rate below 10% (findings must be real issues)
  - Critical findings are actual bugs, not style preferences
---

You are reviewing code changes against team standards.

For each changed file, check:
1. **Error handling**: Every async/await has try-catch or .catch()
2. **Type safety**: All function params and returns have TypeScript types
3. **No debug artifacts**: No console.log, debugger, or TODO comments
4. **Constants**: No magic numbers — extract to named constants
5. **Function length**: Flag any function over 40 lines

For each finding, output:
- File path and line number
- The problematic code snippet (max 3 lines)
- Why it's a problem
- Suggested fix (1 sentence)

Skip files that have zero findings. Do not pad output.

What it looks like

terminal
$ deepwork run code-review --diff "$(git diff main..HEAD)"

Running: code-review
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[1/3] scan-changes
  ✓ 4 files changed, 3 source, 1 test — 127 lines added

[2/3] check-standards
  Gate evaluation... score 94/100
  ✓ Passed — 6 findings across 2 files

[3/3] synthesize-review
  ✓ Assessment: request-changes (2 critical, 3 warnings, 1 suggestion)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✓ Review complete — output: .deepwork/code-review/output/latest/

Expected outcome: A markdown review with line-level findings, grouped by file, with severity tags. Paste it directly into your PR as a comment. No more "generally looks good" reviews — every review is grounded in your defined standards.


2

Research Report Generation

The problem

You ask Claude to research a topic. It gives you a wall of text with no sources, no structure, and you can't tell what's hallucinated vs. verified.

DeepWork solution

A multi-step pipeline that scopes the research, gathers sources with credibility assessments, then synthesizes a structured report with citations.

The job.yml

# .deepwork/research/job.yml
name: research
summary: "Multi-step research with sourced findings and structured report"

step_arguments:
  - name: topic
    description: "Research topic or question"
    type: string

workflows:
  research:
    summary: "Full research pipeline with source verification"
    steps:
      - name: scope
        instructions: |
          Define research scope:
          - Primary question to answer
          - 3-5 sub-questions that support the primary
          - Search strategy (keywords, platforms, date range)
          - What would make the research "complete"
        inputs:
          topic: {}
        outputs:
          scope: {}

      - name: gather
        instructions: |
          Collect at least 8 diverse sources:
          - Mix of official docs, blog posts, GitHub repos, forums
          - For each source: URL, key finding, credibility (high/medium/low)
          - Flag any conflicting information between sources
          - Note gaps — questions with no good sources found
        inputs:
          scope: {}
        outputs:
          sources: {}
        quality_gate:
          criteria:
            - "At least 8 sources collected"
            - "Each source has a credibility assessment"
            - "No two sources from the same domain"
            - "At least one primary source (official docs or repo)"
          threshold: 85
          max_retries: 2

      - name: synthesize
        instructions: |
          Analyze sources and produce findings:
          - Answer each sub-question with evidence
          - Note confidence level per finding (high/medium/low)
          - Highlight contradictions between sources
          - List open questions that couldn't be resolved
        inputs:
          sources: {}
          scope: {}
        outputs:
          analysis: {}

      - name: report
        instructions: |
          Produce final research report in markdown:
          - Executive summary (3-5 sentences)
          - Findings organized by sub-question
          - Inline citations [1], [2] linking to sources
          - Recommendations section
          - Full bibliography at the end
        inputs:
          analysis: {}
          sources: {}
        outputs:
          report: {}
        quality_gate:
          criteria:
            - "Every factual claim has an inline citation"
            - "Bibliography matches all inline citations"
            - "Executive summary is under 5 sentences"
            - "Recommendations are actionable, not generic"
          threshold: 88
          max_retries: 2

Run it

terminal
$ deepwork run research \
    --topic "State of AI code generation tools — Q2 2026"

Running: research
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[1/4] scope
  ✓ 5 sub-questions defined, search strategy set

[2/4] gather
  Gate evaluation... score 82/100
  ✗ No two sources from same domain → 2 sources from github.com
  Retry 1/2 with failure context...
  Gate evaluation... score 91/100
  ✓ Passed — 11 sources, 4 high-credibility

[3/4] synthesize
  ✓ 5 findings with confidence levels, 2 contradictions noted

[4/4] report
  Gate evaluation... score 93/100
  ✓ Passed — 2,400 word report with 11 citations

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✓ Report complete — output: .deepwork/research/output/latest/

Expected outcome: A structured markdown report with executive summary, findings per sub-question, inline citations, and a complete bibliography. The gather step's quality gate caught duplicate domains on the first pass and corrected automatically — you never had to intervene.


3

Data Validation Pipeline

The problem

You ask Claude to write a data transformation script. It handles the happy path but misses nulls, type coercions, empty arrays, and edge cases that break in production.

DeepWork solution

A workflow that analyzes the schema first, writes the transformation with explicit edge case handling, then validates it against a generated test dataset.

The job.yml

# .deepwork/data-validation/job.yml
name: data-validation
summary: "Write and validate data transformations with edge case coverage"

step_arguments:
  - name: schema
    description: "Input data schema or sample JSON"
    type: string
  - name: target
    description: "Desired output format or transformation rules"
    type: string

workflows:
  validate:
    summary: "Analyze → transform → test pipeline"
    steps:
      - name: analyze-schema
        instructions: |
          Analyze the input schema:
          - List every field with its type and nullability
          - Identify fields that could be empty strings, null, 0, or []
          - Note type coercion risks (string "123" vs number 123)
          - Flag any nested objects or arrays that need flattening
          Output a structured schema analysis with edge case inventory.
        inputs:
          schema: {}
        outputs:
          schema_analysis: {}

      - name: write-transform
        instructions: |
          Write a TypeScript transformation function:
          - Handle every edge case from the schema analysis
          - Explicit null checks — no optional chaining hiding failures
          - Type guards for union types
          - Throw descriptive errors for invalid data (not silent defaults)
          - Include JSDoc with @param and @returns
        inputs:
          schema_analysis: {}
          target: {}
        outputs:
          transform: {}
        quality_gate:
          criteria:
            - "Every nullable field has an explicit null check"
            - "Empty string and empty array handled separately from null"
            - "No optional chaining (?.) used as a null guard"
            - "Error messages include the field name and actual value"
          threshold: 90
          max_retries: 3

      - name: generate-test-data
        instructions: |
          Generate test dataset covering:
          - 1 valid happy-path record
          - 1 record with all nullable fields set to null
          - 1 record with empty strings where strings expected
          - 1 record with empty arrays where arrays expected
          - 1 record with type mismatches (string where number expected)
          - 1 record with boundary values (MAX_SAFE_INTEGER, very long strings)
          Output as a JSON array of test cases with expected outcomes.
        inputs:
          schema_analysis: {}
        outputs:
          test_data: {}

      - name: validate
        instructions: |
          Run the transform function against each test case.
          For each case, report:
          - Input summary
          - Expected outcome vs actual outcome
          - PASS or FAIL with reason
          Produce a validation summary with pass rate.
        inputs:
          transform: {}
          test_data: {}
        outputs:
          validation_report: {}
        quality_gate:
          criteria:
            - "All 6 test case types executed"
            - "Pass rate is 100% for valid inputs"
            - "Invalid inputs throw descriptive errors, not undefined behavior"
          threshold: 90
          max_retries: 2

What the gate catches

Gate failed — retrying
Step: write-transform
Score: 71/100

Failed criteria:
✗ No optional chaining used as
  null guard
  → found: user?.address?.city
  → should be explicit null check
    with descriptive error

✗ Error messages include field name
  → found: throw new Error('Invalid')
  → missing field context

Retrying (attempt 2/3)...
Gate passed — proceeding
Step: write-transform
Score: 94/100

✓ Every nullable field has explicit
  null check
✓ Empty string / empty array handled
  separately from null
✓ No optional chaining as null guard
✓ Error messages include field name
  and actual value

Passing output to: generate-test-data

Expected outcome: A transformation function that explicitly handles every edge case in your schema, plus a test dataset and validation report proving it works. The quality gate caught the common ?. anti-pattern and vague error messages on the first attempt. Second attempt fixed both without you typing a word.


4

Documentation Generation

The problem

Claude-generated docs are either too generic ("this function does things") or bloated with obvious information. They never match your team's docs format.

DeepWork solution

A workflow that reads your existing docs for tone and format, analyzes the source code, and generates docs that match your conventions — validated by quality gates.

The job.yml

# .deepwork/docs-gen/job.yml
name: docs-gen
summary: "Generate docs that match existing conventions with complete API coverage"

step_arguments:
  - name: source_files
    description: "Paths to source files to document"
    type: string
  - name: existing_docs
    description: "Path to existing docs for style reference"
    type: string

workflows:
  generate:
    summary: "Analyze conventions → generate → validate completeness"
    steps:
      - name: analyze-conventions
        instructions: |
          Read the existing docs and extract:
          - Heading hierarchy pattern (h1 = module, h2 = function, etc.)
          - Whether examples use TypeScript or JavaScript
          - Tone (formal, conversational, terse)
          - Sections present per function (description, params, returns, example, errors)
          - Any custom formatting (admonitions, callouts, badges)
          Output a style guide summary.
        inputs:
          existing_docs: {}
        outputs:
          style_guide: {}

      - name: extract-api
        instructions: |
          Parse the source files and extract:
          - Every exported function/class/type
          - Parameters with types and defaults
          - Return types
          - Thrown errors or error returns
          - Dependencies and side effects
          Output a structured API inventory.
        inputs:
          source_files: {}
        outputs:
          api_inventory: {}

      - name: generate-docs
        instructions: |
          Generate documentation following the style guide exactly.
          For each API entry:
          - Description (1-2 sentences, no filler)
          - Parameters table with types, defaults, and constraints
          - Return value description
          - At least one usage example
          - Error conditions
          Match the heading hierarchy and tone from the style guide.
        inputs:
          style_guide: {}
          api_inventory: {}
        outputs:
          documentation: {}
        quality_gate:
          criteria:
            - "Every exported function has documentation"
            - "Every function has at least one usage example"
            - "Parameter types match the source code exactly"
            - "No filler phrases: 'This function is used to', 'This is a'"
          threshold: 88
          max_retries: 2

      - name: verify-completeness
        instructions: |
          Cross-reference generated docs against the API inventory.
          Report:
          - Functions documented vs total exported
          - Missing parameters or return types
          - Examples that wouldn't compile
          - Coverage percentage
        inputs:
          documentation: {}
          api_inventory: {}
        outputs:
          coverage_report: {}

Run it

terminal
$ deepwork run docs-gen \
    --source_files "src/services/*.ts" \
    --existing_docs "docs/api/auth.md"

Running: docs-gen
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[1/4] analyze-conventions
  ✓ Style guide extracted — terse tone, TypeScript examples, h2 per function

[2/4] extract-api
  ✓ 23 exports found across 4 files

[3/4] generate-docs
  Gate evaluation... score 84/100
  ✗ No filler phrases → found "This function is used to" in 3 entries
  Retry 1/2 with failure context...
  Gate evaluation... score 92/100
  ✓ Passed — 23 functions documented

[4/4] verify-completeness
  ✓ 23/23 functions covered, 0 missing params, 100% coverage

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✓ Docs complete — output: .deepwork/docs-gen/output/latest/

Expected outcome: API documentation that matches your existing docs' tone, formatting, and structure — with every exported function covered. The filler-phrase gate caught "This function is used to" on the first pass. After retry, every description leads with what the function does, not what it "is used to do."


5

Test-Driven Feature Development

The problem

You ask Claude to build a feature. It writes the implementation first, then generates tests that just confirm what the code already does — never catching real bugs.

DeepWork solution

Enforce TDD discipline: write failing tests from the spec first, then implement to make them pass. Quality gates ensure tests are red before implementation starts.

The job.yml

# .deepwork/tdd-feature/job.yml
name: tdd-feature
summary: "Build features test-first with enforced red-green-refactor discipline"

step_arguments:
  - name: spec
    description: "Feature specification or user story"
    type: string

workflows:
  implement:
    summary: "Spec → red tests → green implementation → refactor"
    steps:
      - name: translate-spec
        instructions: |
          Transform the feature spec into an engineering spec:
          - List each requirement as a testable assertion
          - Identify edge cases implied by the requirements
          - Define the public API surface (function signatures, types)
          - Note any dependencies or external integrations
          Do NOT write any implementation. Spec only.
        inputs:
          spec: {}
        outputs:
          engineering_spec: {}

      - name: red-tests
        instructions: |
          Write failing tests for every requirement in the engineering spec.
          Rules:
          - Each test references a specific requirement by number
          - Tests MUST fail if run now (implementation doesn't exist)
          - Include edge case tests from the spec
          - Use descriptive test names: "should reject negative quantity"
          - Vitest syntax (describe/it/expect)
          Output test file only. No implementation stubs.
        inputs:
          engineering_spec: {}
        outputs:
          test_file: {}
        quality_gate:
          criteria:
            - "Every requirement has at least one test"
            - "Test names describe behavior, not implementation"
            - "No implementation code — only test assertions"
            - "Edge cases from spec are covered"
          threshold: 88
          max_retries: 2

      - name: green-implementation
        instructions: |
          Write the implementation to make ALL tests pass.
          Rules:
          - Implement only what the tests require — no extras
          - Every function has TypeScript types
          - Handle errors explicitly — no silent failures
          After implementation, list which tests now pass.
        inputs:
          test_file: {}
          engineering_spec: {}
        outputs:
          implementation: {}
        quality_gate:
          criteria:
            - "All test assertions are satisfiable by the implementation"
            - "No functionality beyond what tests require"
            - "TypeScript types on all public functions"
            - "Error handling is explicit, not swallowed"
          threshold: 85
          max_retries: 2

      - name: refactor
        instructions: |
          Review the implementation for cleanup:
          - Extract magic numbers to constants
          - Simplify complex conditionals
          - Remove dead code
          - Ensure all tests still pass after changes
          Output the refactored implementation with a changelog.
        inputs:
          implementation: {}
          test_file: {}
        outputs:
          refactored: {}
          changelog: {}

What it looks like

terminal
$ deepwork run tdd-feature \
    --spec "Add quantity discount: 10+ items get 15% off, 50+ get 25% off.
            Discount stacks with member discount but caps at 40% total."

Running: tdd-feature
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[1/4] translate-spec
  ✓ 6 requirements, 4 edge cases (negative qty, zero, cap boundary, float precision)

[2/4] red-tests
  Gate evaluation... score 91/100
  ✓ Passed — 10 test cases, all requirements covered

[3/4] green-implementation
  Gate evaluation... score 87/100
  ✓ Passed — all 10 assertions satisfiable

[4/4] refactor
  ✓ 2 magic numbers extracted, 1 conditional simplified

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✓ Feature complete — output: .deepwork/tdd-feature/output/latest/
  Tests: 10 written (6 requirements + 4 edge cases)
  Refactors: 3 applied, all tests green

Expected outcome: A feature built test-first, with every requirement traced to a test, implementation written to pass those tests, and a refactoring pass that keeps everything green. The red-tests gate ensures Claude actually writes tests from the spec — not from an implementation it already wrote.

Why TDD matters more with AI: When a human writes tests after implementation, they at least have context about what they intended. When Claude writes tests after implementation, it tests what the code does — not what it should do. TDD-first with quality gates is the only way to get tests that actually catch bugs instead of confirming existing behavior.


Patterns Across All Five Workflows

If you look at these workflows side by side, the same structural patterns emerge. These are the things that make the difference between "Claude did something" and "Claude did the right thing":

The Learn Loop

After running any workflow a few times, run deepwork learn <job-name> --last-n 5. DeepWork analyzes what triggered retries across your last 5 runs and updates the SKILL.md files automatically — tightening criteria, adding missed edge cases, improving prompt clarity.

Workflows that self-improve. First-run pass rates climb from ~85% to 93%+ after two learn cycles. You spend less time per run, not more.

Run These Workflows in 10 Minutes

Install DeepWork, copy any workflow above, and run it. No accounts, no cloud setup, no API keys beyond Claude.

brew tap unsupervisedcom/deepwork
brew install deepwork
View on GitHub Join Early Access

Get early access for hosted workflows, scheduled runs, and team skill sharing:

You're on the list. We'll be in touch.


Getting Started with DeepWork — Your First Quality-Gated AI Workflow →
How to Build Repeatable Claude Code Workflows with Quality Gates →
Why Claude Code Output Quality Degrades (And How to Fix It) →
DeepWork vs Manual Claude Code Workflows: What Changes When You Add Quality Gates →

Questions or feedback? Open an issue on GitHub.