Back to Blog
Tutorial

How to Build Repeatable Claude Code Workflows with Quality Gates

14 min read March 14, 2026

You've automated something once with Claude Code. It worked great. Then you tried to run it again next week—different context, slightly different phrasing—and the output was half as good. You tweaked the prompt, got it back to 80%, and thought "close enough." You ran it a third time and it was worse again.

Ad-hoc Claude Code usage doesn't scale. Every run is a fresh gamble. The quality you got once isn't guaranteed next time. There's no enforcement layer, no memory, no feedback mechanism. What you need is a workflow—a defined, repeatable process with checkpoints that catch drift before it reaches you.

This tutorial walks through building exactly that. We'll use a competitive research workflow as our example—something most development teams run regularly. By the end, you'll have a template you can adapt to any repeated Claude Code task.

Why Ad-Hoc Usage Fails at Scale

Before building the solution, let's be precise about the problem. Ad-hoc Claude Code usage fails for three reasons:

The fix isn't better prompts. It's architecture: define the steps, gate the outputs, capture what works.

Why Claude Code Output Quality Degrades (And How to Fix It)
The root causes of quality drift—context compaction, session loss, manual steering overhead.

The Three-Layer Architecture

Repeatable workflows need three layers:

  1. Define — A job.yml that specifies every step, what skill to use, and what quality threshold passes.
  2. Execute — A runner that processes each step, runs the quality gate, and retries on failure—without you watching.
  3. Learn — An analysis pass after each run that updates the skill files with what worked and what didn't.

Let's build each layer for a competitive research workflow.

Step 1: Define Your Job

A job file is a YAML declaration of your workflow. It's checked into your repo, versioned, and shared with your team. Every run uses the same definition—no prompt guessing, no remembered context.

Here's a jobs/competitive-research.yml:

name: competitive-research
description: Weekly competitive intelligence report
schedule: weekly

steps:
  - id: gather-sources
    skill: research/gather-sources
    input:
      competitors: ["CompetitorA", "CompetitorB", "CompetitorC"]
      sources: ["product-pages", "changelog", "job-postings", "social"]
    quality:
      threshold: 75
      criteria:
        - "All listed competitors covered"
        - "At least 3 sources per competitor"
        - "Data from last 7 days only"

  - id: extract-signals
    skill: research/extract-signals
    depends_on: gather-sources
    quality:
      threshold: 80
      criteria:
        - "Product changes identified with dates"
        - "Pricing changes flagged if present"
        - "Hiring trends noted (eng, sales, ml)"
        - "No speculation—only verifiable facts"

  - id: synthesize-report
    skill: research/synthesize-report
    depends_on: extract-signals
    output_format: markdown
    quality:
      threshold: 85
      criteria:
        - "Executive summary under 200 words"
        - "Key signals sorted by impact (high/medium/low)"
        - "Recommended actions included"
        - "No duplicate information from gather step"

Notice what this file does: it breaks the workflow into three discrete steps, each with its own skill and quality threshold. Step 2 only runs if step 1 passes. Step 3 only runs if step 2 passes. Failure is local—it doesn't contaminate downstream steps.

Step 2: Write SKILL.md Files

Each step references a skill—a persistent instruction file that Claude reads before executing that step. Skills are the antidote to session-loss. Instead of re-explaining your standards in every prompt, you write them once and reference them forever.

A skill file has two parts: a YAML frontmatter block and a Markdown body.

Here's .claude/skills/research/extract-signals/SKILL.md:

---
name: extract-signals
version: 3
description: Extract competitive signals from raw research data
quality_criteria:
  - Product changes identified with release dates
  - Pricing changes flagged with before/after values
  - Hiring trends quantified (headcount delta if visible)
  - Only verifiable facts, no inference
last_updated: 2026-03-07
pass_rate: 0.87
---

# Extract Competitive Signals

## Context
You receive raw research data from the gather-sources step. Your job is to identify meaningful signals—changes that indicate strategic direction, competitive pressure, or market positioning shifts.

## What Counts as a Signal

**High-impact signals:**
- Pricing changes (any direction)
- New product lines or feature categories
- Major engineering hires (VP-level, large batch)
- Public partnerships or integrations announced

**Medium-impact signals:**
- Feature additions to existing products
- Rebrand or positioning language changes
- Sales team expansion in specific geographies

**Low-impact signals:**
- Minor UI changes
- Blog posts (unless announcing product)
- Hiring for non-strategic roles

## Output Format

Return a JSON array. Each signal:
```json
{
  "competitor": "string",
  "signal_type": "pricing|product|hiring|partnership|positioning",
  "impact": "high|medium|low",
  "description": "one sentence, factual",
  "evidence_url": "source URL",
  "date_observed": "YYYY-MM-DD"
}
```

## Common Mistakes to Avoid
- Don't flag things that haven't changed—only delta from last week
- Don't speculate about intent—describe what you observe
- Don't include signals older than 7 days (check evidence_url dates)

That's a real skill file. Specific output format, concrete examples, explicit list of what not to do. This is what makes results repeatable: Claude reads the same instructions every run.

Skill file best practices: Keep skills focused on one thing. "Research" is too broad. "Extract competitive signals from raw HTML" is right-sized. The more specific the skill, the tighter the quality gate can be, and the more reliably it improves over learn cycles.

Step 3: Add Quality Gates

A quality gate evaluates each step's output before allowing the workflow to proceed. It runs Claude against your criteria and produces a score. Below threshold: automatic retry with the failure reason as additional context. Above threshold: proceed to next step.

The gate prompt looks something like this (you don't write this manually—DeepWork generates it from your job.yml):

Review this output against the quality criteria below.
Score 0-100. Return JSON: {"score": N, "passed": bool, "failures": ["..."]}.

Output to evaluate:
[step output here]

Quality criteria:
- Product changes identified with release dates
- Pricing changes flagged with before/after values
- Hiring trends quantified (headcount delta if visible)
- Only verifiable facts, no inference

Threshold: 80

If the score is 79, the runner feeds the failure reasons back to Claude and retries. If it fails three times, it escalates—flags the run as needing human review rather than silently producing bad output.

This is the part that changes everything. You stop being the quality gate. The system enforces standards. You review outputs that already passed—not hunt for things that went wrong.

Step 4: Run It

With DeepWork, running the workflow is one command:

deepwork run competitive-research

The CLI reads your job.yml, loads the skill files for each step, executes them in order, runs quality gates, retries failures, and writes the final output to outputs/competitive-research/[timestamp].md. You come back in 20 minutes to a validated report.

Because quality gates handle enforcement, you can fire and forget. Background execution without babysitting. This is the leverage AI coding tools promise but rarely deliver.

Step 5: Build the Learn Loop

This is the step most people skip. It's also what makes quality compound over time instead of staying flat.

After a run completes, run the learn command:

deepwork learn competitive-research

The learn pass does three things:

  1. Analyzes gate scores. Which steps had the highest retry rates? Which criteria were most commonly failed? What patterns show up in successful outputs that aren't in the skill file?
  2. Proposes skill improvements. Based on the analysis, it suggests specific edits to the relevant SKILL.md files—new examples, clarified criteria, additional "mistakes to avoid."
  3. Updates version and pass rate. The skill's frontmatter gets bumped with the new version number and updated pass rate. You have a full history of improvement.

Here's what a learn pass might add to extract-signals/SKILL.md:

# Added by learn pass on 2026-03-14 (run #7, pass rate: 0.87 → 0.91)

## Common Mistakes to Avoid (updated)
- Don't flag things that haven't changed—only delta from last week
- Don't speculate about intent—describe what you observe
- Don't include signals older than 7 days (check evidence_url dates)
- **NEW:** CompetitorA's changelog is at /changelog not /releases—check both
- **NEW:** Job posting counts fluctuate daily; use 7-day rolling average

Run seven became run eight became run fifteen. Each learn pass bakes in a discovered pattern. The workflow gets smarter without you doing anything.

What the Numbers Look Like

Here's what this architecture produces in practice, measured across real workflow runs:

The improvement is mechanical, not magical. Each learn pass adds 1-3 specific improvements to the skill files. Accumulated over 10-15 runs, the skill definitions become extremely precise. Claude isn't getting smarter—your instructions are.

The compounding effect: A workflow you've run 15 times with learn cycles is categorically different from one you've run 15 times ad-hoc. Ad-hoc gives you 15 independent rolls of the dice. Structured workflows give you 15 iterations of progressive refinement. One improves nothing. The other builds a reusable, high-quality asset.

Adapting This to Other Workflows

The competitive research example is just a template. The same architecture applies to any repeated Claude Code task:

The pattern is the same: decompose into steps, gate each step, run learn cycles. The specific criteria change. The architecture doesn't. For copy-paste job.yml configs across five production workflows, see 5 DeepWork Workflows That Replace Manual Claude Code Babysitting.

Getting Started with DeepWork

DeepWork packages this entire architecture into a CLI. Install it, define your first job, run it, and see validated output in minutes. No infrastructure setup, no custom tooling.

brew tap unsupervisedcom/deepwork
brew install deepwork

Initialize your first workflow:

deepwork init my-workflow

This generates the job.yml scaffold and empty skill files. Fill in your steps and criteria, run it, and let the learn loop do the rest.

Build Your First Repeatable Workflow

Install DeepWork and turn your best Claude Code session into a workflow that runs reliably every time—with quality gates and learn loops built in.

brew tap unsupervisedcom/deepwork
brew install deepwork
View on GitHub Join Early Access

Or get early access for hosted workflows, scheduled runs, and team skill sharing:

You're on the list. We'll be in touch.

Summary

Ad-hoc Claude Code usage doesn't compound. Every run is a fresh start, quality varies, and you end up being the quality gate yourself. That's not leverage.

Repeatable workflows fix this with three pieces:

The result: Claude Code workflows that run in the background, enforce quality automatically, and get reliably better with each iteration. That's the difference between a one-time automation and a production-grade workflow.


Why Claude Code Output Quality Degrades (And How to Fix It) →
DeepWork vs Manual Claude Code Workflows: What Changes When You Add Quality Gates →
Getting Started with DeepWork — Your First Quality-Gated AI Workflow →
5 DeepWork Workflows That Replace Manual Claude Code Babysitting →

Questions or feedback? Open an issue on GitHub.