Testing Concepts for AI-Assisted Development¶

AI can generate tests faster than humans can review them. In practice, many AI-generated tests execute code without checking correctness.

The practical risk is simple: AI tends to document what code does, while humans need tests that specify what code should do. If AI reads implementation before writing tests, existing bugs can become test expectations.

Approaches when using AI¶

Several approaches work, depending on what you are trying to validate.

TDD + mutation (this workshop): write failing tests from requirements, implement to pass, then verify with mutation testing. Works well for new features with clear requirements and strong isolation rules.
Mutation-first: implement code, generate mutants, and add tests to kill survivors. Useful for characterizing existing code.
Property-based: define invariants and generate randomized inputs. Best for algorithms and data structures.
Approval/snapshot: capture current behavior and review diffs. Useful for refactoring and legacy code.
REPL-driven: explore behavior interactively, then capture cases as tests. Useful when learning a new domain.

You can mix these; for example, TDD for a feature and property-based tests inside its core algorithm.

Why isolation matters¶

Without checkpoints, AI can add unrequested behavior, miss edge cases, or remove tests that fail.

Context isolation is the main guardrail. Write tests from requirements, then implement after tests exist. Avoid reading implementation before test writing.

Mutation testing then exposes weak verification by revealing survivors that tests did not catch.

Role separation and enforcement guardrails help keep tests from drifting toward implementation details.

Fast feedback keeps the loop tolerable. The examples later in this section show the style in practice.

Forcing Functions¶

Telling an AI "don't read implementation code" is probabilistic. The model may misinterpret, overlook the instruction buried in context, or prioritize being "helpful" by reading code to write better tests.

Hard enforcement moves the boundary from the prompt layer to the execution layer. Instead of asking the AI to respect isolation, we mechanically block the actions that would violate it.

flowchart LR
    A[Advisory<br/>Instruction] --> B{Model<br/>Interprets}
    B -->|Understands| C[Follows Rule]
    B -->|Misinterprets| D[Violates Rule]
    B -->|Prioritizes<br/>Helpfulness| D

    E[Mechanical<br/>Enforcement] --> F{Tool Call}
    F --> G[Runtime<br/>Checks Path]
    G -->|Forbidden| H[Access<br/>Denied]
    G -->|Allowed| I[Access<br/>Granted]

    style A fill:#c0392b,color:#fff
    style E fill:#27ae60,color:#fff
    style D fill:#e74c3c,color:#fff
    style H fill:#2ecc71,color:#000

Three-Agent Workflow (Optional)¶

For situations requiring strict context isolation, the TDD workflow can use three specialized agents with hard path boundaries:

flowchart TD
    A[Requirements] --> B[spec-writer]
    B --> C[Spec]
    C --> D[isolated-test-writer]
    D --> E[Tests RED]
    E --> F[Implementation]
    F --> G[Tests GREEN]
    G --> H[Mutation Testing]
    H --> I{Survivors?}
    I -->|Yes| D
    I -->|No| J[slop-test-reviewer]
    J --> K{Issues?}
    K -->|Yes| D
    K -->|No| L[COMMIT]

    style B fill:#3498db,color:#fff
    style D fill:#3498db,color:#fff
    style J fill:#3498db,color:#fff
    style E fill:#e74c3c,color:#fff
    style G fill:#27ae60,color:#fff
    style L fill:#2ecc71,color:#000

spec-writer¶

Transforms user stories into testable requirements without reading implementation code.

Path boundaries: - ALLOWED: docs/, specs/, requirements/, *.md, domain documentation - FORBIDDEN: src/, lib/, app/, packages/, implementation code

Output: Structured specifications using EARS notation (Ubiquitous, Event-driven, State-driven, Unwanted behavior, Optional feature) with Given/When/Then acceptance criteria. Requirements use hierarchical IDs like SCORE-REQ-1.1 for traceability.

Example:

SCORE-REQ-1.3:
  Given: Two rockets with identical stability and speed
  When: One has altitude=1000 and one has altitude=100
  Then: Higher altitude rocket has higher score

isolated-test-writer¶

Writes tests from requirements alone, maintaining the RED phase of TDD. Tests must fail initially because no implementation exists yet.

Path boundaries: - ALLOWED: tests/, specs/, *.test.*, *.spec.*, test utilities, *.d.ts type definitions - FORBIDDEN: src/, lib/, app/, core/, services/, models/, handlers/, implementation modules

Output: Failing tests (RED phase) structured by requirement ID with behavioral assertions. Tests verify what code should do, not what it currently does.

Example:

describe("Requirement: SCORE-REQ-1.3 - Higher altitude produces higher score", () => {
  test("should score higher altitude better than lower altitude", () => {
    const lowScore = calculateScore(100, "stable", 5);
    const highScore = calculateScore(1000, "stable", 5);
    expect(highScore).toBeGreaterThan(lowScore);
  });
});

Contamination risk: If the agent reads implementation before writing tests, existing bugs become encoded as test expectations. The path restrictions prevent this mechanically.

Note: This agent is optional. Current models often produce slop (generic or tautological tests) when using strict isolation on autopilot. The agent works best with active human guidance and exceptionally clear requirements. For most scenarios, writing tests directly from specs using TDD principles achieves better quality with less overhead.

slop-test-reviewer¶

Identifies AI-generated test anti-patterns after tests are written. Detects tests that execute code but don't verify correctness.

Focus: Recently modified test files (unless explicitly instructed to review broader scope)

Seven detection patterns: 1. Mock abuse - assertions verify mocked values 2. Tautologies - tests assert what they just constructed 3. Existence-only checks - only verifying fields exist without checking values 4. Implementation mirroring - test logic recapitulates production logic 5. Happy path only - missing error cases and boundary tests 6. Copy-paste variations - near-identical tests with different data 7. Variable amnesia - undefined or inconsistent variable names

Output: Remediation guidance with before/after code examples.

How Path Restrictions Work¶

Each agent configuration defines ALLOWED and FORBIDDEN path patterns. When the agent attempts to read a file:

Agent requests file access through Read tool
Runtime checks path against FORBIDDEN patterns
If match → access denied with explanation
If no match → check ALLOWED patterns
Access granted only if explicitly allowed

This creates a "pit of success"—the agent cannot accidentally violate isolation even if it wants to be helpful. The boundary is mechanical, not advisory.

graph TB
    subgraph spec["spec-writer"]
        SA["✓ ALLOWED<br/>───────<br/>docs/<br/>specs/<br/>requirements/<br/>*.md"]
        SF["✗ FORBIDDEN<br/>───────<br/>src/<br/>lib/<br/>app/<br/>packages/"]
    end

    subgraph test["isolated-test-writer"]
        TA["✓ ALLOWED<br/>───────<br/>tests/<br/>*.test.*<br/>*.spec.*<br/>*.d.ts"]
        TF["✗ FORBIDDEN<br/>───────<br/>src/<br/>lib/<br/>core/<br/>services/"]
    end

    subgraph review["slop-test-reviewer"]
        RA["✓ SCOPE<br/>───────<br/>Recently modified<br/>test files"]
        RF["✓ DETECTS<br/>───────<br/>7 slop patterns"]
    end

    style spec fill:#2980b9,color:#fff
    style test fill:#8e44ad,color:#fff
    style review fill:#16a085,color:#fff
    style SA fill:#27ae60,color:#fff
    style SF fill:#c0392b,color:#fff
    style TA fill:#27ae60,color:#fff
    style TF fill:#c0392b,color:#fff
    style RA fill:#27ae60,color:#fff
    style RF fill:#f39c12,color:#000

For detailed phase-by-phase implementation, see TDD + Mutation Example Workflow.

Choosing test types¶

Type	Purpose	Speed	Example	AI Weakness
Unit	Single function behavior	Fast	`calculateScore(100, "stable", 5)`	Over-mocks, misses invariants
Integration	Component contracts	Medium	API + database + validation	Weak error cases
E2E	User workflows	Slow	Launch rocket, see score, reset	Brittle selectors
Property	Invariants across randomized inputs	Medium	Score always 0-100	Identifying meaningful properties

Test pyramid guidance: most tests should be unit and integration, with a small number of E2E tests for critical paths. E2E tests provide realism but are slower and more fragile than unit tests, so keep them focused on critical user workflows.

Decision triggers: - Verifying business logic? → Unit + property-based - Verifying component interactions? → Integration with real dependencies - Verifying user experience? → E2E with browser automation - Verifying invariants? → Property-based with generators

Common failure modes¶

Patterns that often show up in AI-written tests:

Mock abuse: Assertions verify mocked values.

const mockUser = { id: 123 };
when(service.get()).thenReturn(mockUser);
expect(result.id).toBe(123); // Always passes

Tautological: Values compared to themselves.

expect(user.name).toBe(user.name); // Useless

Existence-only: Checks fields exist without verifying values.

expect(response).toHaveProperty("userId"); // Weak
expect(response.userId).toBe(expectedId);   // Strong

Implementation mirroring: Test recapitulates code logic.

expect(calc(a, b)).toBe(a * 0.5 + bonus(b)); // Duplicates implementation

Happy path only: Missing error cases, boundary tests.

Copy-paste variations: 50 near-identical tests. Use test.each() for parameterization.

Variable amnesia: Undefined variables signal hallucination.

Detection: Mock count > assertion count. Only toHaveProperty checks. No toThrow tests. 10+ identical test structures.

Remediation details in tdd-example-workflow.md.

Mutation testing basics¶

Mutation testing introduces deliberate faults. If tests still pass, verification is weak.

Mechanics: 1. Change < to <= in source 2. Run test suite 3. Tests fail → mutant killed (good) 4. Tests pass → mutant survived (weak test) 5. Score = killed / total

Example:

// Original
if (day < 1) return false;

// Mutant
if (day <= 1) return false;

// Weak test (survives)
expect(isValid(0)).toBe(false); // Passes both

// Strong test (kills mutant)
expect(isValid(1)).toBe(true); // Fails mutant

Integration: after tests pass, run mutation testing. Survivors point to missing tests.

How this fits the workshop¶

The workflow relies on fast feedback tools (validation concepts) and hard enforcement via hooks (hooks) to keep tests isolated from implementation.

Workflow skills (skills) and specialized subagents keep responsibilities separated and reduce drift. The rules concepts section explains why hard enforcement beats best-effort instructions.

Next Steps¶

Read TDD + Mutation Example Workflow for phase-by-phase integration.

Examine raccoon-rocket-lab sandbox for working examples.

Adapt principles (isolation, forcing functions, mutation verification) to your tools and context.