How to Build Repeatable Claude Code Workflows with Quality Gates
You've automated something once with Claude Code. It worked great. Then you tried to run it again next week—different context, slightly different phrasing—and the output was half as good. You tweaked the prompt, got it back to 80%, and thought "close enough." You ran it a third time and it was worse again.
Ad-hoc Claude Code usage doesn't scale. Every run is a fresh gamble. The quality you got once isn't guaranteed next time. There's no enforcement layer, no memory, no feedback mechanism. What you need is a workflow—a defined, repeatable process with checkpoints that catch drift before it reaches you.
This tutorial walks through building exactly that. We'll use a competitive research workflow as our example—something most development teams run regularly. By the end, you'll have a template you can adapt to any repeated Claude Code task.
Why Ad-Hoc Usage Fails at Scale
Before building the solution, let's be precise about the problem. Ad-hoc Claude Code usage fails for three reasons:
- No step definition. "Research our competitors" is not a workflow. Without defined steps, Claude makes different choices each run—which sources to check, what to look for, how deep to go. Results vary wildly.
- No quality enforcement. There's nothing checking whether each step's output actually meets your standards before moving to the next step. One weak step poisons everything downstream.
- No learning mechanism. You do this task weekly. But each run starts from zero. Improvements you discovered last week don't carry forward. You're perpetually debugging the same patterns.
The fix isn't better prompts. It's architecture: define the steps, gate the outputs, capture what works.
The Three-Layer Architecture
Repeatable workflows need three layers:
- Define — A
job.ymlthat specifies every step, what skill to use, and what quality threshold passes. - Execute — A runner that processes each step, runs the quality gate, and retries on failure—without you watching.
- Learn — An analysis pass after each run that updates the skill files with what worked and what didn't.
Let's build each layer for a competitive research workflow.
Step 1: Define Your Job
A job file is a YAML declaration of your workflow. It's checked into your repo, versioned, and shared with your team. Every run uses the same definition—no prompt guessing, no remembered context.
Here's a jobs/competitive-research.yml:
name: competitive-research
description: Weekly competitive intelligence report
schedule: weekly
steps:
- id: gather-sources
skill: research/gather-sources
input:
competitors: ["CompetitorA", "CompetitorB", "CompetitorC"]
sources: ["product-pages", "changelog", "job-postings", "social"]
quality:
threshold: 75
criteria:
- "All listed competitors covered"
- "At least 3 sources per competitor"
- "Data from last 7 days only"
- id: extract-signals
skill: research/extract-signals
depends_on: gather-sources
quality:
threshold: 80
criteria:
- "Product changes identified with dates"
- "Pricing changes flagged if present"
- "Hiring trends noted (eng, sales, ml)"
- "No speculation—only verifiable facts"
- id: synthesize-report
skill: research/synthesize-report
depends_on: extract-signals
output_format: markdown
quality:
threshold: 85
criteria:
- "Executive summary under 200 words"
- "Key signals sorted by impact (high/medium/low)"
- "Recommended actions included"
- "No duplicate information from gather step"
Notice what this file does: it breaks the workflow into three discrete steps, each with its own skill and quality threshold. Step 2 only runs if step 1 passes. Step 3 only runs if step 2 passes. Failure is local—it doesn't contaminate downstream steps.
Step 2: Write SKILL.md Files
Each step references a skill—a persistent instruction file that Claude reads before executing that step. Skills are the antidote to session-loss. Instead of re-explaining your standards in every prompt, you write them once and reference them forever.
A skill file has two parts: a YAML frontmatter block and a Markdown body.
Here's .claude/skills/research/extract-signals/SKILL.md:
---
name: extract-signals
version: 3
description: Extract competitive signals from raw research data
quality_criteria:
- Product changes identified with release dates
- Pricing changes flagged with before/after values
- Hiring trends quantified (headcount delta if visible)
- Only verifiable facts, no inference
last_updated: 2026-03-07
pass_rate: 0.87
---
# Extract Competitive Signals
## Context
You receive raw research data from the gather-sources step. Your job is to identify meaningful signals—changes that indicate strategic direction, competitive pressure, or market positioning shifts.
## What Counts as a Signal
**High-impact signals:**
- Pricing changes (any direction)
- New product lines or feature categories
- Major engineering hires (VP-level, large batch)
- Public partnerships or integrations announced
**Medium-impact signals:**
- Feature additions to existing products
- Rebrand or positioning language changes
- Sales team expansion in specific geographies
**Low-impact signals:**
- Minor UI changes
- Blog posts (unless announcing product)
- Hiring for non-strategic roles
## Output Format
Return a JSON array. Each signal:
```json
{
"competitor": "string",
"signal_type": "pricing|product|hiring|partnership|positioning",
"impact": "high|medium|low",
"description": "one sentence, factual",
"evidence_url": "source URL",
"date_observed": "YYYY-MM-DD"
}
```
## Common Mistakes to Avoid
- Don't flag things that haven't changed—only delta from last week
- Don't speculate about intent—describe what you observe
- Don't include signals older than 7 days (check evidence_url dates)
That's a real skill file. Specific output format, concrete examples, explicit list of what not to do. This is what makes results repeatable: Claude reads the same instructions every run.
Skill file best practices: Keep skills focused on one thing. "Research" is too broad. "Extract competitive signals from raw HTML" is right-sized. The more specific the skill, the tighter the quality gate can be, and the more reliably it improves over learn cycles.
Step 3: Add Quality Gates
A quality gate evaluates each step's output before allowing the workflow to proceed. It runs Claude against your criteria and produces a score. Below threshold: automatic retry with the failure reason as additional context. Above threshold: proceed to next step.
The gate prompt looks something like this (you don't write this manually—DeepWork generates it from your job.yml):
Review this output against the quality criteria below.
Score 0-100. Return JSON: {"score": N, "passed": bool, "failures": ["..."]}.
Output to evaluate:
[step output here]
Quality criteria:
- Product changes identified with release dates
- Pricing changes flagged with before/after values
- Hiring trends quantified (headcount delta if visible)
- Only verifiable facts, no inference
Threshold: 80
If the score is 79, the runner feeds the failure reasons back to Claude and retries. If it fails three times, it escalates—flags the run as needing human review rather than silently producing bad output.
This is the part that changes everything. You stop being the quality gate. The system enforces standards. You review outputs that already passed—not hunt for things that went wrong.
Step 4: Run It
With DeepWork, running the workflow is one command:
deepwork run competitive-research
The CLI reads your job.yml, loads the skill files for each step, executes them in order, runs quality gates, retries failures, and writes the final output to outputs/competitive-research/[timestamp].md. You come back in 20 minutes to a validated report.
Because quality gates handle enforcement, you can fire and forget. Background execution without babysitting. This is the leverage AI coding tools promise but rarely deliver.
Step 5: Build the Learn Loop
This is the step most people skip. It's also what makes quality compound over time instead of staying flat.
After a run completes, run the learn command:
deepwork learn competitive-research
The learn pass does three things:
- Analyzes gate scores. Which steps had the highest retry rates? Which criteria were most commonly failed? What patterns show up in successful outputs that aren't in the skill file?
- Proposes skill improvements. Based on the analysis, it suggests specific edits to the relevant SKILL.md files—new examples, clarified criteria, additional "mistakes to avoid."
- Updates version and pass rate. The skill's frontmatter gets bumped with the new version number and updated pass rate. You have a full history of improvement.
Here's what a learn pass might add to extract-signals/SKILL.md:
# Added by learn pass on 2026-03-14 (run #7, pass rate: 0.87 → 0.91)
## Common Mistakes to Avoid (updated)
- Don't flag things that haven't changed—only delta from last week
- Don't speculate about intent—describe what you observe
- Don't include signals older than 7 days (check evidence_url dates)
- **NEW:** CompetitorA's changelog is at /changelog not /releases—check both
- **NEW:** Job posting counts fluctuate daily; use 7-day rolling average
Run seven became run eight became run fifteen. Each learn pass bakes in a discovered pattern. The workflow gets smarter without you doing anything.
What the Numbers Look Like
Here's what this architecture produces in practice, measured across real workflow runs:
- Run 1 (no learn cycles): ~83% average gate score
- Run 3 (2 learn cycles): ~88% average gate score
- Run 7 (6 learn cycles): ~92% average gate score
- Run 15 (14 learn cycles): 95%+ consistently
The improvement is mechanical, not magical. Each learn pass adds 1-3 specific improvements to the skill files. Accumulated over 10-15 runs, the skill definitions become extremely precise. Claude isn't getting smarter—your instructions are.
The compounding effect: A workflow you've run 15 times with learn cycles is categorically different from one you've run 15 times ad-hoc. Ad-hoc gives you 15 independent rolls of the dice. Structured workflows give you 15 iterations of progressive refinement. One improves nothing. The other builds a reusable, high-quality asset.
Adapting This to Other Workflows
The competitive research example is just a template. The same architecture applies to any repeated Claude Code task:
- Code review workflows: Steps: read PR diff → apply coding standards → check edge cases → generate review comments. Gate criteria: "All changed functions reviewed," "Security implications noted," "No false positives on existing patterns."
- Content generation: Steps: research topic → draft outline → write sections → quality check for accuracy → format for publishing. Gate criteria: "All claims sourced," "Reading grade level appropriate," "No passive voice in headlines."
- Incident analysis: Steps: gather logs → identify root cause → draft postmortem → extract action items. Gate criteria: "Timeline accurate," "Root cause specific not vague," "Action items assigned and time-bounded."
The pattern is the same: decompose into steps, gate each step, run learn cycles. The specific criteria change. The architecture doesn't. For copy-paste job.yml configs across five production workflows, see 5 DeepWork Workflows That Replace Manual Claude Code Babysitting.
Getting Started with DeepWork
DeepWork packages this entire architecture into a CLI. Install it, define your first job, run it, and see validated output in minutes. No infrastructure setup, no custom tooling.
brew tap unsupervisedcom/deepwork
brew install deepwork
Initialize your first workflow:
deepwork init my-workflow
This generates the job.yml scaffold and empty skill files. Fill in your steps and criteria, run it, and let the learn loop do the rest.
Build Your First Repeatable Workflow
Install DeepWork and turn your best Claude Code session into a workflow that runs reliably every time—with quality gates and learn loops built in.
brew tap unsupervisedcom/deepwork
brew install deepwork
View on GitHub
Join Early Access
Or get early access for hosted workflows, scheduled runs, and team skill sharing:
You're on the list. We'll be in touch.
Summary
Ad-hoc Claude Code usage doesn't compound. Every run is a fresh start, quality varies, and you end up being the quality gate yourself. That's not leverage.
Repeatable workflows fix this with three pieces:
- job.yml — Defines every step, what skill to use, and what passes quality review. Versioned, shared, consistent.
- SKILL.md files — Persistent instructions that Claude reads on every run. No more re-teaching patterns. Skills are composable and improve over time.
- Learn cycles — After each run, analyze what worked and update the skill files. Quality compounds automatically—no manual intervention.
The result: Claude Code workflows that run in the background, enforce quality automatically, and get reliably better with each iteration. That's the difference between a one-time automation and a production-grade workflow.
Questions or feedback? Open an issue on GitHub.