commandby NeoLabHQ
Implement Task
Implement a task with automated LLM-as-Judge verification for critical steps
Installs: 0
Used in: 1 repos
Updated: 2d ago
$
npx ai-builder add command NeoLabHQ/implement-taskInstalls to .claude/commands/implement-task.md
# Implement Task with Verification
Execute task implementation steps with automated quality verification using LLM-as-Judge for critical artifacts.
## User Input
```text
Task File: $ARGUMENTS
```
---
## CRITICAL: You Are an ORCHESTRATOR ONLY
**Your role is DISPATCH and AGGREGATE. You do NOT do the work.**
Properly build context of sub agents!
CRITICAL: For each sub-agent (implementation and evaluation), you need to provide:
- Task file path
- Step number
- Item number (if applicable)
- Artifact path (if applicable)
### What You DO
- Read the task file ONCE (Phase 1 only)
- Launch sub-agents via Task tool
- Receive reports from sub-agents
- Mark stages complete after judge confirmation
- Aggregate results and report to user
### What You NEVER Do
| Prohibited Action | Why | What To Do Instead |
|-------------------|-----|-------------------|
| Read implementation outputs | Context bloat → command loss | Sub-agent reports what it created |
| Read reference files | Sub-agent's job to understand patterns | Include path in sub-agent prompt |
| Read artifacts to "check" them | Context bloat → forget verifications | Launch judge agent |
| Evaluate code quality yourself | Not your job, causes forgetting | Launch judge agent |
| Skip verification "because simple" | ALL verifications are mandatory | Launch judge agent anyway |
### Anti-Rationalization Rules
**If you think:** "I should read this file to understand what was created"
**→ STOP.** The sub-agent's report tells you what was created. Use that information.
**If you think:** "I'll quickly verify this looks correct"
**→ STOP.** Launch a judge agent. That's not your job.
**If you think:** "This is too simple to need verification"
**→ STOP.** If the task specifies verification, launch the judge. No exceptions.
**If you think:** "I need to read the reference file to write a good prompt"
**→ STOP.** Put the reference file PATH in the sub-agent prompt. Sub-agent reads it.
### Why This Matters
Orchestrators who read files themselves = context overflow = command loss = forgotten steps. Every time.
Orchestrators who "quickly verify" = skip judge agents = quality collapse = failed artifacts.
**Your context window is precious. Protect it. Delegate everything.**
---
## Overview
This command orchestrates multi-step task implementation with:
1. **Sequential execution** respecting step dependencies
2. **Parallel execution** where dependencies allow
3. **Automated verification** using judge agents for critical steps
4. **Panel of LLMs (PoLL)** for high-stakes artifacts
5. **Aggregated voting** with position bias mitigation
6. **Stage tracking** with confirmation after each judge passes
---
## Complete Workflow Overview
```
Phase 1: Load Task
│
▼
Phase 2: Execute Steps
│
├─── For each step in dependency order:
│ │
│ ▼
│ ┌─────────────────────────────────────────────────┐
│ │ Launch developer agent │
│ │ (implementation) │
│ └─────────────────┬───────────────────────────────┘
│ │
│ ▼
│ ┌─────────────────────────────────────────────────┐
│ │ Launch judge agent(s) │
│ │ (verification per #### Verification section) │
│ └─────────────────┬───────────────────────────────┘
│ │
│ ▼
│ ┌─────────────────────────────────────────────────┐
│ │ Judge PASS? → Mark step complete in task file │
│ │ Judge FAIL? → Fix and re-verify (max 2 retries) │
│ └─────────────────────────────────────────────────┘
│
▼
Phase 3: Final Report
```
---
## Phase 1: Load and Analyze Task
**This is the ONLY phase where you read a file.**
### Step 1.1: Load Task Details
Read the task file ONCE:
```bash
Read .specs/tasks/$TASK_FILE
```
**After this read, you MUST NOT read any other files for the rest of execution.**
### Step 1.2: Identify Implementation Steps
Parse the `## Implementation Process` section:
- List all steps with dependencies
- Identify which steps have `Parallel with:` annotations
- Classify each step's verification needs from `#### Verification` sections:
| Verification Level | When to Use | Judge Configuration |
|--------------------|-------------|---------------------|
| None | Simple operations (mkdir, delete) | Skip verification |
| Single Judge | Non-critical artifacts | 1 judge, threshold 4.0/5.0 |
| Panel of 2 Judges | Critical artifacts | 2 judges, median voting, threshold 4.5/5.0 |
| Per-Item Judges | Multiple similar items | 1 judge per item, parallel |
### Step 1.3: Create Todo List
Create TodoWrite with all implementation steps, marking verification requirements:
```json
{
"todos": [
{"content": "Step 1: [Title] - [Verification Level]", "status": "pending", "activeForm": "Implementing Step 1"},
{"content": "Step 2: [Title] - [Verification Level]", "status": "pending", "activeForm": "Implementing Step 2"}
]
}
```
---
## Phase 2: Execute Implementation Steps
For each step in dependency order:
### Pattern A: Simple Step (No Verification)
**1. Launch Developer Agent:**
Use Task tool with:
- **Agent Type**: `developer`
- **Model**: As specified in step or `opus` by default
- **Description**: "Implement Step [N]: [Title]"
- **Prompt**:
```
Implement Step [N]: [Step Title]
Task File: .specs/tasks/$TASK_FILE
Step Number: [N]
Your task:
- Execute ONLY Step [N]: [Step Title]
- Do NOT execute any other steps
- Follow the Expected Output and Success Criteria exactly
When complete, report:
1. What files were created/modified (paths)
2. Confirmation that success criteria are met
3. Any issues encountered
```
**2. Use Agent's Report (No Verification)**
- Agent reports what was created → Use this information
- **DO NOT read the created files yourself**
- This pattern has NO verification (simple operations)
**3. Mark Step Complete**
- Update task file:
- Mark step title with `[DONE]` (e.g., `### Step 1: Setup [DONE]`)
- Mark step's subtasks as `[X]` complete
- Update todo to `completed`
---
### Pattern B: Critical Step (Panel of 2 Evaluations)
**1. Launch Developer Agent:**
Use Task tool with:
- **Agent Type**: `developer`
- **Model**: As specified in step or `opus` by default
- **Description**: "Implement Step [N]: [Title]"
- **Prompt**:
```
Implement Step [N]: [Step Title]
Task File: .specs/tasks/$TASK_FILE
Step Number: [N]
Your task:
- Execute ONLY Step [N]: [Step Title]
- Do NOT execute any other steps
- Follow the Expected Output and Success Criteria exactly
When complete, report:
1. What files were created/modified (paths)
2. Confirmation of completion
3. Self-critique summary
```
**2. Wait for Completion**
- Receive the agent's report
- Note the artifact path(s) from the report
- **DO NOT read the artifact yourself**
**3. Launch 2 Evaluation Agents in Parallel (MANDATORY):**
**⚠️ MANDATORY: This pattern requires launching evaluation agents. You MUST launch these evaluations. Do NOT skip. Do NOT verify yourself.**
**Use `developer` agent type for evaluations**
**Evaluation 1 & 2** (launch both in parallel with same prompt structure):
```
Read ./plugins/sadd/tasks/judge.md for evaluation methodology.
Evaluate artifact at: [artifact_path from implementation agent report]
**Chain-of-Thought Requirement:** Justification MUST be provided BEFORE score for each criterion.
Rubric:
[paste rubric table from #### Verification section]
Context:
- Read .specs/tasks/$TASK_FILE
- Verify Step [N] ONLY: [Step Title]
- Threshold: [from #### Verification section]
- Reference pattern: [if specified in #### Verification section]
You can verify the artifact works - run tests, check imports, validate syntax.
Return: scores per criterion with evidence, overall weighted score, PASS/FAIL, improvements if FAIL.
```
**4. Aggregate Results:**
- Calculate median score per criterion
- Flag high-variance criteria (std > 1.0)
- Pass if median overall ≥ threshold
**5. On PASS: Mark Step Complete**
- Update task file:
- Mark step title with `[DONE]` (e.g., `### Step 2: Create Service [DONE]`)
- Mark step's subtasks as `[X]` complete
- Update todo to `completed`
- Record judge scores in tracking
**6. On FAIL:**
- Present issues to user
- Ask: "Should I attempt to fix these issues?"
- If yes, re-implement and re-verify (max 2 retries)
---
### Pattern C: Multi-Item Step (Per-Item Evaluations)
For steps that create multiple similar items:
**1. Launch Developer Agents in Parallel (one per item):**
Use Task tool for EACH item (launch all in parallel):
- **Agent Type**: `developer`
- **Model**: As specified or `opus` by default
- **Description**: "Implement Step [N], Item: [Name]"
- **Prompt**:
```
Implement Step [N], Item: [Item Name]
Task File: .specs/tasks/$TASK_FILE
Step Number: [N]
Item: [Item Name]
Your task:
- Create ONLY [item_name] from Step [N]
- Do NOT create other items or steps
- Follow the Expected Output and Success Criteria exactly
When complete, report:
1. File path created
2. Confirmation of completion
3. Self-critique summary
```
**2. Wait for All Completions**
- Collect all agent reports
- Note all artifact paths
- **DO NOT read any of the created files yourself**
**3. Launch Evaluation Agents in Parallel (one per item)**
**⚠️ MANDATORY: Launch evaluation agents. Do NOT skip. Do NOT verify yourself.**
**Use `developer` agent type for evaluations**
For each item:
```
Read ./plugins/sadd/tasks/judge.md for evaluation methodology.
Evaluate artifact at: [item_path from implementation agent report]
**Chain-of-Thought Requirement:** Justification MUST be provided BEFORE score for each criterion.
Rubric:
[paste rubric from #### Verification section]
Context:
- Read .specs/tasks/$TASK_FILE
- Verify Step [N]: [Step Title]
- Verify ONLY this Item: [Item Name]
- Threshold: [from #### Verification section]
You can verify the artifact works - run tests, check syntax, confirm dependencies.
Return: scores with evidence, overall score, PASS/FAIL, improvements if FAIL.
```
**4. Collect All Results**
**5. Report Aggregate:**
- Items passed: X/Y
- Items needing revision: [list with specific issues]
**6. On ALL PASS: Mark Step Complete**
- Update task file:
- Mark step title with `[DONE]` (e.g., `### Step 3: Create Items [DONE]`)
- Mark step's subtasks as `[X]` complete
- Update todo to `completed`
- Record pass rate in tracking
**7. If Any FAIL:**
- Present failing items with judge feedback
- Ask: "Should I fix failing items?"
- If yes, re-implement only failing items and re-verify
---
## ⚠️ CHECKPOINT: Before Proceeding to Phase 3
Before moving to final report, verify you followed the rules:
- [ ] Did you launch developer agents for ALL implementations?
- [ ] Did you launch evaluation agents for ALL verifications?
- [ ] Did you mark steps complete ONLY after judge PASS?
- [ ] Did you avoid reading ANY artifact files yourself?
**If you read files other than the task file, you are doing it wrong. STOP and restart.**
---
## Phase 3: Verification Specifications
Task files define verification in `#### Verification` sections with:
### Required Elements
1. **Level**: None / Single / Panel (2) / Per-Item
2. **Artifact(s)**: Path(s) to file(s) being verified
3. **Threshold**: Passing score (typically 4.0/5.0 or 4.5/5.0)
4. **Rubric**: Table with criteria, weights, and descriptions
5. **Reference Pattern**: (Optional) Path to example of good implementation
### Rubric Format
```markdown
| Criterion | Weight | Description |
|-----------|--------|-------------|
| [Name 1] | 0.XX | [What to evaluate] |
| [Name 2] | 0.XX | [What to evaluate] |
| ... | ... | ... |
```
Weights MUST sum to 1.0.
### Scoring Scale
For each criterion:
- **1 (Poor)**: Does not meet requirements
- **2 (Below Average)**: Multiple issues, partially meets requirements
- **3 (Adequate)**: Meets basic requirements
- **4 (Good)**: Meets all requirements, few minor issues
- **5 (Excellent)**: Exceeds requirements
---
## Phase 4: Aggregation and Reporting
### Panel Voting Algorithm
When using 2+ evaluations, follow these manual computation steps:
- Think in steps, output each step result separately!
- Do not skip steps!
#### Step 1: Collect Scores per Criterion
Create a table with each criterion and scores from all evaluations:
| Criterion | Eval 1 | Eval 2 | Median | Difference |
|-----------|--------|--------|--------|------------|
| [Name 1] | X.X | X.X | ? | ? |
| [Name 2] | X.X | X.X | ? | ? |
#### Step 2: Calculate Median for Each Criterion
For 2 evaluations: **Median = (Score1 + Score2) / 2**
For 3+ evaluations: Sort scores, take middle value (or average of two middle values if even count)
#### Step 3: Check for High Variance
**High variance** = evaluators disagree significantly (difference > 2.0 points)
Formula: `|Eval1 - Eval2| > 2.0` → Flag as high variance
#### Step 4: Calculate Weighted Overall Score
Multiply each criterion's median by its weight and sum:
```
Overall = (Criterion1_Median × Weight1) + (Criterion2_Median × Weight2) + ...
```
#### Step 5: Determine Pass/Fail
Compare overall score to threshold:
- `Overall ≥ Threshold` → **PASS** ✅
- `Overall < Threshold` → **FAIL** ❌
---
### Handling Disagreement
If evaluations significantly disagree (difference > 2.0 on any criterion):
1. Flag the criterion
2. Present both evaluators' reasoning
3. Ask user: "Evaluators disagree on [criterion]. Review manually?"
4. If yes: present evidence, get user decision
5. If no: use median (conservative approach)
### Final Report
After all steps complete:
```markdown
## Implementation Summary
### Steps Completed
| Step | Title | Status | Verification | Score | Judge Confirmed |
|------|-------|--------|--------------|-------|-----------------|
| 1 | [Title] | ✅ | Skipped | N/A | - |
| 2 | [Title] | ✅ | Panel (2) | 4.5/5 | ✅ |
| 3 | [Title] | ✅ | Per-Item | 5/5 passed | ✅ |
### Verification Summary
- Total steps: X
- Steps with verification: Y
- Passed on first try: Z
- Required revision: W
- Final pass rate: 100%
### High-Variance Criteria (Evaluators Disagreed)
- [Criterion] in [Step]: Eval 1 scored X, Eval 2 scored Y
### Task File Updated
- All step titles marked `[DONE]`
- All step subtasks marked `[X]`
- Definition of Done items verified
### Recommendations
1. [Any follow-up actions]
2. [Suggested improvements]
```
---
## Execution Flow Diagram
```
┌──────────────────────────────────────────────────────────────┐
│ IMPLEMENT TASK WITH VERIFICATION │
├──────────────────────────────────────────────────────────────┤
│ │
│ Phase 1: Load Task │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Read .specs/tasks/$TASK_FILE → Parse steps │ │
│ │ → Extract #### Verification specs → Create TodoWrite │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Phase 2: Execute Steps (Respecting Dependencies) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ For each step: │ │
│ │ │ │
│ │ ┌──────────────┐ ┌───────────────┐ ┌───────────┐ │ │
│ │ │ developer │───▶│ Judge Agent │───▶│ PASS? │ │ │
│ │ │ Agent │ │ (verify) │ │ │ │ │
│ │ └──────────────┘ └───────────────┘ └───────────┘ │ │
│ │ │ │ │ │
│ │ Yes No │ │
│ │ │ │ │ │
│ │ ▼ ▼ │ │
│ │ ┌────────┐ Fix & │ │ │
│ │ │ Mark │ Retry │ │ │
│ │ │Complete│ ↺ │ │ │
│ │ └────────┘ │ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Phase 4: Aggregate & Report │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Collect all verification results │ │
│ │ → Calculate aggregate metrics │ │
│ │ → Generate final report │ │
│ │ → Present to user │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────┘
```
---
## Usage Examples
### Example 1: Implementing a Feature
```
User: /implement-task task-add-validation.md
Phase 1: Loading task...
Task: "Add form validation service"
Steps identified: 4 steps
Verification plan (from #### Verification sections):
- Step 1: No verification (directory creation)
- Step 2: Panel of 2 evaluations (ValidationService)
- Step 3: Per-item evaluations (3 validators)
- Step 4: Single evaluation (integration)
Phase 2: Executing...
Step 1: Launching developer agent...
Agent: "Implement Step 1: Create Directory Structure..."
Result: ✅ Directories created
Verification: Skipped (simple operation)
Status: ✅ COMPLETE
Step 2: Launching developer agent...
Agent: "Implement Step 2: Create ValidationService..."
Result: Files created, tests passing
Launching 2 judge agents in parallel...
Judge 1: 4.3/5.0 - PASS
Judge 2: 4.5/5.0 - PASS
Panel Result: 4.4/5.0 ✅
Status: ✅ COMPLETE (Judge Confirmed)
[Continue for all steps...]
Phase 4: Final Report
Implementation complete.
- 4/4 steps completed
- 6 artifacts verified
- All passed first try
- Final pass rate: 100%
```
### Example 2: Handling Verification Failure
```
Step 3 Implementation complete.
Launching judge agents...
Judge 1: 3.5/5.0 - FAIL (threshold 4.0)
Judge 2: 3.2/5.0 - FAIL
Issues found:
- Test Coverage: 2.5/5
Evidence: "Missing edge case tests for empty input"
Justification: "Success criteria requires edge case coverage"
- Pattern Adherence: 3.0/5
Evidence: "Uses custom Result type instead of project standard"
Justification: "Should use existing Result<T, E> from src/types"
Should I attempt to fix these issues? [Y/n]
User: Y
Launching developer agent with feedback...
Agent: "Fix Step 3: Address judge feedback..."
Result: Issues fixed, tests added
Re-launching judge agents...
Judge 1: 4.2/5.0 - PASS
Judge 2: 4.4/5.0 - PASS
Panel Result: 4.3/5.0 ✅
Status: ✅ COMPLETE (Judge Confirmed)
```
---
## Error Handling
### Implementation Failure
If developer agent reports failure:
1. Present the failure details to user
2. Ask clarification questions that could help resolve
3. Launch developer agent again with clarifications
### Judge Disagreement
If judges disagree significantly (difference > 2.0):
1. Present both perspectives with evidence
2. Ask user to resolve: "Judges disagree. Your decision?"
3. Proceed based on user decision
### Max Retries Exceeded
If a step fails verification after 2 retries:
1. Report all attempts and feedback
2. Ask: "Step [N] failed verification 2 times. Options:"
- Continue anyway (with noted quality concerns)
- Stop implementation
- Provide additional guidance and retry
---
## Checklist
Before completing implementation:
### Context Protection (CRITICAL)
- [ ] Read ONLY the task file (`.specs/tasks/$TASK_FILE`) - no other files
- [ ] Did NOT read implementation outputs, reference files, or artifacts
- [ ] Used sub-agent reports for status - did NOT read files to "check"
### Delegation
- [ ] ALL implementations done by `developer` agents via Task tool
- [ ] ALL evaluations done by `developer` agents via Task tool
- [ ] Did NOT perform any verification yourself
- [ ] Did NOT skip any verification steps
### Stage Tracking
- [ ] Each step marked complete ONLY after judge PASS
- [ ] Task file updated after each step completion:
- Step title marked with `[DONE]`
- Subtasks marked with `[X]`
- [ ] Todo list updated after each step completion
### Execution Quality
- [ ] All steps executed in dependency order
- [ ] Parallel steps launched simultaneously (not sequentially)
- [ ] Each developer agent received focused prompt with exact step
- [ ] All critical artifacts evaluated by judges
- [ ] Panel voting used for high-stakes artifacts
- [ ] Chain-of-thought requirement included in all evaluation prompts
- [ ] Failed evaluations addressed (max 2 retries per step)
- [ ] Final report generated with judge confirmation status
- [ ] User informed of any evaluator disagreementsQuick Install
$
npx ai-builder add command NeoLabHQ/implement-taskDetails
- Type
- command
- Author
- NeoLabHQ
- Slug
- NeoLabHQ/implement-task
- Created
- 6d ago