Skip to content

TDD + Mutation Testing Workflow

Five phases: Spec → Test (RED) → Implement (GREEN) → Mutate → Review.

AI-written tests tend to mirror implementation. Human-written tests should capture intent. This workflow keeps the source of truth in requirements rather than in code.

Spec-Writer → Isolated-Test-Writer → Implementation → Mutation → Slop-Review → COMMIT
(requirements)    (failing tests)      (pass tests)    (verify)    (quality)

This workflow describes the full TDD approach with optional specialized agents. The core phases (spec, test, implement, mutate, review) apply regardless of whether you use specialized agents or write tests directly.

Workflow Overview

This workflow enforces TDD discipline through three specialized agents with mechanical path restrictions. Each agent operates in its own context with hard boundaries that prevent contamination.

flowchart TD
    A[Requirements] --> B[spec-writer]
    B --> C[Spec]
    C --> D[isolated-test-writer]
    D --> E[Tests RED]
    E --> F[Implementation]
    F --> G[Tests GREEN]
    G --> H[Mutation]
    H --> I[slop-reviewer]
    I --> J[COMMIT]

    style B fill:#3498db,color:#fff
    style D fill:#3498db,color:#fff
    style I fill:#3498db,color:#fff
    style E fill:#e74c3c,color:#fff
    style G fill:#27ae60,color:#fff
    style J fill:#2ecc71,color:#000

The workflow guarantees that tests are written from requirements before any implementation exists, maintaining true RED → GREEN progression.

Note: The specialized agents (spec-writer, isolated-test-writer) are optional. The workflow works with: - Full enforcement: Use all three agents for maximum isolation - Partial enforcement: Use spec-writer for requirements, write tests directly - Minimal enforcement: Write specs and tests directly, use slop-test-reviewer for quality

Use a fast test runner, a mutation testing tool, and a formatter/linter to keep feedback tight. Commands vary by repo.

Phase 1: Specification (spec-writer)

Spec-writer reads docs/, specs/, requirements/. FORBIDDEN: src/, lib/, app/. Reading implementation before specs documents bugs as features.

EARS notation helps structure requirements:

  • Ubiquitous: "The system shall..."
  • Event-driven: "When [event], the system shall..."
  • State-driven: "While [state], the system shall..."
  • Unwanted behavior: "If [condition], the system shall..."
  • Optional feature: "Where [feature enabled], the system shall..."

Example scoring requirements:

SCORE-REQ-1: Score Calculation
  SCORE-REQ-1.1: When rocket lands, the system shall calculate score based on altitude, stability, and speed
  SCORE-REQ-1.2: The score shall be a number between 0 and 100
  SCORE-REQ-1.3: If altitude is higher, the score shall be higher
  SCORE-REQ-1.4: If stability is worse, the score shall be lower
  SCORE-REQ-1.5: If landing speed exceeds safe threshold, the score shall be reduced

Given/When/Then acceptance criteria make requirements testable:

SCORE-REQ-1.1:
  Given: A rocket with altitude=500, stability="stable", speed=5
  When: Score is calculated
  Then: Score is between 0 and 100

SCORE-REQ-1.3:
  Given: Two rockets with identical stability and speed
  When: One has altitude=1000 and one has altitude=100
  Then: Higher altitude rocket has higher score

Skip specs for typos, constants, and obvious bugs. Use them for new features, complex logic, security-sensitive code, or team alignment.

Hooks enforce this mechanically: PreToolUse blocks implementation reads. Instructions are best-effort; hooks are not.

Phase 2: Test Writing - RED Phase

Write tests from requirements ONLY. Avoid reading implementation code before tests exist.

Option A - Direct Test Writing (Recommended): Apply TDD principles directly from specs using Given/When/Then structure. This approach produces better results for most workflows.

Option B - Enforced Isolation (Optional): Use the isolated-test-writer agent for strict path enforcement. Note: current models often produce slop when using strict isolation on autopilot - this option works best with active human guidance and exceptionally clear requirements:

ALLOWED: tests/, specs/, docs/, requirements/, *.md, *.spec.*, *.test.*, test utilities, fixtures

FORBIDDEN: src/, lib/, app/, packages/, internal/, core/, services/, models/, handlers/, implementation modules

Exception: Reading public API signatures from .d.ts type definition files is acceptable.

This is the RED phase of TDD. Tests should fail because no implementation exists yet. If a test passes immediately, either: 1. Implementation already exists (not greenfield development) 2. Test is checking the wrong thing (tautological assertion) 3. Test has a bug

Given/When/Then structure keeps tests readable:

describe("Requirement: SCORE-REQ-1.3 - Higher altitude produces higher score", () => {
  test("should score higher altitude better than lower altitude", () => {
    // Given: Two scenarios with identical stability and speed, different altitudes
    const lowAltitude = 100;
    const highAltitude = 1000;
    const stability = "stable";
    const speed = 5;

    // When: Scores are calculated
    const lowScore = calculateScore(lowAltitude, stability, speed);
    const highScore = calculateScore(highAltitude, stability, speed);

    // Then: Higher altitude produces higher score
    expect(highScore).toBeGreaterThan(lowScore);
  });
});

Example test structure (boundary coverage).

describe("Score Calculation Requirements", () => {
  describe("SCORE-REQ-1.2: Score bounds", () => {
    test("minimum score is 0", () => {
      const score = calculateScore(-1000, "unstable", 100);
      expect(score).toBeGreaterThanOrEqual(0);
    });

    test("maximum score is 100", () => {
      const score = calculateScore(10000, "perfect", 0);
      expect(score).toBeLessThanOrEqual(100);
    });
  });

  describe("SCORE-REQ-1.3: Altitude affects score", () => {
    test.each([
      [0, 50],
      [100, 500],
      [500, 1000],
      [1000, 2000]
    ])("altitude %i should score lower than altitude %i", (low, high) => {
      const lowScore = calculateScore(low, "stable", 5);
      const highScore = calculateScore(high, "stable", 5);
      expect(highScore).toBeGreaterThan(lowScore);
    });
  });

  describe("SCORE-REQ-1.5: Landing speed penalty", () => {
    test("safe speed (5 m/s) has no penalty", () => {
      const safeScore = calculateScore(500, "stable", 5);
      expect(safeScore).toBeGreaterThan(0);
    });

    test("excessive speed (50 m/s) reduces score", () => {
      const safeScore = calculateScore(500, "stable", 5);
      const fastScore = calculateScore(500, "stable", 50);
      expect(fastScore).toBeLessThan(safeScore);
    });
  });
});

Tests organized by requirement ID. Parameterized for boundaries. Behavioral assertions (higher/lower), not exact values.

Integration test example (zero mocks).

/**
 * IMPORTANT: These tests verify BEHAVIOR from REQUIREMENTS.
 *
 * NO MOCKS. This is an integration test suite for the useFlight hook.
 * We test the hook's contract (what it returns and how it behaves),
 * not its implementation details.
 */

Test structure follows requirements:

describe("Requirement: FLIGHT-REQ-1 - Start idle", () => {
  test("initial status is idle", () => {
    const { result } = renderHook(() => useFlight(defaultRocket));
    expect(result.current.status).toBe("idle");
  });

  test("initial position is at ground level", () => {
    const { result } = renderHook(() => useFlight(defaultRocket));
    expect(result.current.position).toEqual({ altitude: 0, velocity: 0 });
  });
});

describe("Requirement: FLIGHT-REQ-2 - Countdown before launch", () => {
  test("countdown starts at 3", () => {
    const { result } = renderHook(() => useFlight(defaultRocket));
    act(() => result.current.launch());
    expect(result.current.countdown).toBe(3);
  });

  test("countdown decrements each second", () => {
    const { result } = renderHook(() => useFlight(defaultRocket));
    act(() => result.current.launch());

    act(() => vi.advanceTimersByTime(1000));
    expect(result.current.countdown).toBe(2);

    act(() => vi.advanceTimersByTime(1000));
    expect(result.current.countdown).toBe(1);
  });
});

The tests use an advanceUntil() helper that abstracts timing implementation:

const advanceUntil = (condition: () => boolean, maxIterations = 100) => {
  let iterations = 0;
  while (!condition() && iterations < maxIterations) {
    act(() => vi.advanceTimersByTime(100));
    iterations++;
  }
  if (iterations >= maxIterations) {
    throw new Error("Condition never met");
  }
};

// Usage:
test("flight eventually lands", () => {
  const { result } = renderHook(() => useFlight(defaultRocket));
  act(() => result.current.launch());

  advanceUntil(() => result.current.status === "landed");
  expect(result.current.status).toBe("landed");
});

This abstraction prevents tests from breaking when timing implementation changes. Tests care about "eventually lands," not "lands after exactly 8.3 seconds."

When to use isolated-test-writer vs a general agent:

Use isolated-test-writer for new features, legacy test retrofits where you want to avoid reading code, and security-critical work where requirements must drive tests.

Use a general agent for bug fixes (where you need to read the failing code), characterization tests, and refactoring where the goal is to lock in existing behavior.

Common pitfall - contamination example:

// CONTAMINATED: AI read implementation and encoded its bugs
test("should return null when user not found", () => {
  // Implementation has a bug: returns null instead of throwing error
  // AI read this and wrote a test that expects null
  const user = findUser("unknown");
  expect(user).toBeNull(); // Encodes the bug as expected behavior
});

// ISOLATED: From requirements alone
test("should throw UserNotFoundError when user not found", () => {
  // Requirements say: "throw UserNotFoundError if user doesn't exist"
  // Test written before implementation, from requirements
  expect(() => findUser("unknown")).toThrow(UserNotFoundError);
});

The contaminated test makes the bug permanent. The isolated test catches it.

Phase 3: Implementation - GREEN Phase

Write minimum code to make tests pass. The general agent now has access to tests and implements functionality.

This is the GREEN phase: tests go from failing (red) to passing (green). The goal is the simplest implementation that satisfies requirements as verified by tests.

General agent reads tests and implements. Specialized agents (spec-writer, isolated-test-writer) finished maintaining isolation.

Warning signs that AI development is drifting:

Warning Sign What It Looks Like Why It Happens Recovery
Loop Same test fails repeatedly, AI tries variations AI doesn't understand root cause Stop. Read failing test and implementation. Explain the gap to AI.
Scope creep Features appear that weren't in requirements AI "helpfully" adds nice-to-haves Revert. Point AI to requirements: "Only implement what's specified."
Test deletion AI removes failing tests instead of fixing code Taking easiest path to green Never allow. Tests define correctness.
Excessive mocking More mocks than assertions AI avoiding real implementation Reduce mocks. Use real dependencies or test is too integrated.
Context pollution AI references implementation details in new tests Contamination spreading Stop. Use isolated-test-writer for new tests.

Commit-on-green:

# After tests pass
git add src/scoring.ts tests/unit/scoring.test.ts
git commit -m "feat(scoring): implement altitude-based scoring

Implements SCORE-REQ-1.1 through SCORE-REQ-1.5.
All 41 tests passing. Ready for mutation testing."

# Clean context for next feature

Small commits keep context focused. Each commit = working state. Reverting to last green is safe.

Implementation example (scoring.ts):

export function calculateScore(
  altitude: number,
  stability: "perfect" | "stable" | "unstable",
  landingSpeed: number
): number {
  // Altitude contribution (0-50 points)
  const altitudeScore = Math.min(50, altitude / 40);

  // Stability contribution (0-30 points)
  const stabilityScore = stability === "perfect" ? 30
    : stability === "stable" ? 20
    : 10;

  // Speed penalty (0-20 point reduction)
  const speedPenalty = Math.max(0, (landingSpeed - 5) * 2);

  const rawScore = altitudeScore + stabilityScore - speedPenalty;

  // Clamp to 0-100
  return Math.max(0, Math.min(100, rawScore));
}

Keep the implementation small and direct; avoid premature abstraction. 41 tests verify correctness.

Phase 4: Mutation Testing - Verification

After tests pass, verify they catch faults. Mutation testing introduces deliberate bugs. If tests still pass, they're weak.

Target specific files: new code, complex logic, security-sensitive paths. Mutating everything is expensive.

Mutation reports show which mutants survived and which tests ran.

Survivor example:

// Original
if (altitude < 0) return 0;

// Mutant (survived)
if (altitude <= 0) return 0;  // Changed < to <=

// Why it survived
// No test checks calculateScore(0, "stable", 5)
// Tests only check negative values and positive values

Iterative test improvement: survivors reveal missing tests.

// Mutation reveals gap: no test for altitude=0
describe("SCORE-REQ-1.2: Score bounds - boundary cases", () => {
  test("altitude of 0 is valid and scores above minimum", () => {
    const score = calculateScore(0, "stable", 5);
    expect(score).toBeGreaterThan(0);
  });
});

Run mutation testing again. If the mutant dies, test quality improved.

Missing test vs equivalent mutant: a survivor from altitude * 0.5 → 0.6 indicates a missing test. A survivor from i++ → ++i is equivalent and can be accepted. Expect some equivalents.

Phase 5: Quality Review (slop-test-reviewer)

Slop-test-reviewer applies seven-pattern checklist: mock abuse, tautological assertions, existence-only checks, implementation mirroring, happy path only, copy-paste variations, variable amnesia.

Detection signals: mock count > assertions, only toHaveProperty checks, no toThrow tests, or 10+ identical test structures.

Remediation examples:

Mock abuse - before/after:

// BEFORE (slop)
const mockUser = { id: 123, name: "Alice", role: "admin" };
when(userService.getUser(123)).thenReturn(mockUser);

const result = permissions.check(123, "delete");

expect(result.user.role).toBe("admin"); // Verifying mocked data

// AFTER (remediation)
const mockUser = { id: 123, name: "Alice", role: "admin" };
when(userService.getUser(123)).thenReturn(mockUser);

const result = permissions.check(123, "delete");

expect(result.allowed).toBe(true); // Verifying behavior
expect(result.reason).toBe("admin role has delete permission");

Copy-Paste Variations:

// BEFORE (slop - 10 nearly identical tests)
test("scores altitude 100", () => {
  expect(calculateScore(100, "stable", 5)).toBe(20);
});
test("scores altitude 200", () => {
  expect(calculateScore(200, "stable", 5)).toBe(25);
});
test("scores altitude 300", () => {
  expect(calculateScore(300, "stable", 5)).toBe(30);
});
// ... 7 more identical tests

// AFTER (remediation - parameterized)
test.each([
  [100, 20],
  [200, 25],
  [300, 30],
  [400, 35],
  [500, 40],
  [1000, 50],
  [2000, 50]  // Caps at 50
])("altitude %i scores %i points", (altitude, expectedScore) => {
  const score = calculateScore(altitude, "stable", 5);
  expect(score).toBe(expectedScore);
});

When to use: AI-generated tests, quick writes without review, surprisingly high mutation score (false positives from tautology). This review is short and often catches weak tests early.

Real examples

Example 1: scoring.ts + scoring.test.ts

41 tests with boundary coverage, mutation-focused.

Test organization:

describe("Score Calculation Requirements", () => {
  describe("SCORE-REQ-1.1: Score calculation basics", () => {
    // 3 tests
  });

  describe("SCORE-REQ-1.2: Score bounds", () => {
    // 8 tests covering min/max/clamp behavior
  });

  describe("SCORE-REQ-1.3: Altitude affects score", () => {
    // 12 tests with parameterized altitude ranges
  });

  describe("SCORE-REQ-1.4: Stability rating affects score", () => {
    // 6 tests for perfect/stable/unstable
  });

  describe("SCORE-REQ-1.5: Landing speed affects score", () => {
    // 7 tests for safe/moderate/excessive speed
  });

  describe("Edge cases", () => {
    // 5 tests for NaN, Infinity, negative, zero
  });
});

Why 41 tests for a simple scoring function? Mutation testing exposed the missing cases.

Mutation testing revealed gaps:

Initial version had fewer tests. Survivors:

// Survivor 1: Boundary mutation
if (altitude < 0) return 0;
// Changed to: if (altitude <= 0) return 0;
// No test checked altitude=0

Added test:

test("altitude of exactly 0 is valid", () => {
  const score = calculateScore(0, "stable", 5);
  expect(score).toBeGreaterThanOrEqual(0);
});
// Survivor 2: Arithmetic mutation
const altitudeScore = Math.min(50, altitude / 40);
// Changed to: const altitudeScore = Math.min(50, altitude / 41);
// No test verified the exact contribution ratio

Added parameterized test:

test.each([
  [0, 0],      // 0 altitude = 0 points
  [40, 1],     // 40 altitude = 1 point
  [400, 10],   // 400 altitude = 10 points
  [2000, 50]   // 2000 altitude = max 50 points
])("altitude %i contributes %i to score", (altitude, expectedContribution) => {
  // Test with neutral stability and speed to isolate altitude contribution
  const score = calculateScore(altitude, "stable", 5);
  const baseScore = 20; // Stable contribution
  expect(score).toBeCloseTo(expectedContribution + baseScore, 0);
});

After iteration: expanded test coverage.

What makes these tests mutation-resistant:

  • Boundary value testing (0, 1, max, max+1)
  • Parameterized tests catch arithmetic mutations
  • Edge case coverage (NaN, Infinity, negative)
  • Behavior verification (higher altitude = higher score) catches operator mutations

Example 2: useFlight.ts + useFlight.test.ts

470-line integration test, zero mocks.

No mocks. Tests use real rendering and timers. Mocking state and timers would test mocks, not hook behavior.

Structure: every describe starts with "Requirement: [ID] - [description]". 8 requirements × 3-8 tests each.

Timing abstraction: the advanceUntil helper.

advanceUntil(() => result.current.status === "landed");
expect(result.current.status).toBe("landed");

The test passes even if timing details change. Timer intervals and physics step size can change without breaking tests.

Example 3: Infrastructure test

Meta-testing: a config test validates mutation testing configuration. Tests verify the config exists, targets correct files, and excludes test files.

Why? Configuration drift breaks mutation testing. Tests document required structure. The test makes configuration drift visible early. Pattern extends to ESLint, TypeScript, build configs.

When this workflow works

Use it for new features with clear requirements, complex logic (scoring, parsers, state machines), security-critical code (auth, crypto, validation), team collaboration, and refactoring legacy code.

Skip it for trivial changes (typos, constants), obvious fixes, exploratory spikes, rapidly changing requirements, or prototype code.

Adapt it for different tools (pytest + mutmut, JUnit + pitest, go test + go-mutesting), team preferences (mutation-first, property-based), domains (medical devices need higher rigor), and project phases (early development skips mutation, mature products need 90%+ scores).

Principles transfer: context isolation, forcing functions, mutation verification.

Next Steps

Try the workflow on a small feature. Notice where it helps and where it creates friction. Combine with other approaches (mutation-first, property-based) as needed.