Skip to content

Testing Concepts for AI-Assisted Development

AI can generate tests faster than humans can review them. In practice, many AI-generated tests execute code without checking correctness.

The practical risk is simple: AI tends to document what code does, while humans need tests that specify what code should do. If AI reads implementation before writing tests, existing bugs can become test expectations.

Approaches when using AI

Several approaches work, depending on what you are trying to validate.

  • TDD + mutation (this workshop): write failing tests from requirements, implement to pass, then verify with mutation testing. Works well for new features with clear requirements and strong isolation rules.
  • Mutation-first: implement code, generate mutants, and add tests to kill survivors. Useful for characterizing existing code.
  • Property-based: define invariants and generate randomized inputs. Best for algorithms and data structures.
  • Approval/snapshot: capture current behavior and review diffs. Useful for refactoring and legacy code.
  • REPL-driven: explore behavior interactively, then capture cases as tests. Useful when learning a new domain.

You can mix these; for example, TDD for a feature and property-based tests inside its core algorithm.

Why isolation matters

Without checkpoints, AI can add unrequested behavior, miss edge cases, or remove tests that fail.

Context isolation is the main guardrail. Write tests from requirements, then implement after tests exist. Avoid reading implementation before test writing.

Mutation testing then exposes weak verification by revealing survivors that tests did not catch.

Role separation and enforcement guardrails help keep tests from drifting toward implementation details.

Fast feedback keeps the loop tolerable. The examples later in this section show the style in practice.

Forcing Functions

Telling an AI "don't read implementation code" is probabilistic. The model may misinterpret, overlook the instruction buried in context, or prioritize being "helpful" by reading code to write better tests.

Hard enforcement moves the boundary from the prompt layer to the execution layer. Instead of asking the AI to respect isolation, we mechanically block the actions that would violate it.

flowchart LR
    A[Advisory<br/>Instruction] --> B{Model<br/>Interprets}
    B -->|Understands| C[Follows Rule]
    B -->|Misinterprets| D[Violates Rule]
    B -->|Prioritizes<br/>Helpfulness| D

    E[Mechanical<br/>Enforcement] --> F{Tool Call}
    F --> G[Runtime<br/>Checks Path]
    G -->|Forbidden| H[Access<br/>Denied]
    G -->|Allowed| I[Access<br/>Granted]

    style A fill:#c0392b,color:#fff
    style E fill:#27ae60,color:#fff
    style D fill:#e74c3c,color:#fff
    style H fill:#2ecc71,color:#000

Three-Agent Workflow (Optional)

For situations requiring strict context isolation, the TDD workflow can use three specialized agents with hard path boundaries:

flowchart TD
    A[Requirements] --> B[spec-writer]
    B --> C[Spec]
    C --> D[isolated-test-writer]
    D --> E[Tests RED]
    E --> F[Implementation]
    F --> G[Tests GREEN]
    G --> H[Mutation Testing]
    H --> I{Survivors?}
    I -->|Yes| D
    I -->|No| J[slop-test-reviewer]
    J --> K{Issues?}
    K -->|Yes| D
    K -->|No| L[COMMIT]

    style B fill:#3498db,color:#fff
    style D fill:#3498db,color:#fff
    style J fill:#3498db,color:#fff
    style E fill:#e74c3c,color:#fff
    style G fill:#27ae60,color:#fff
    style L fill:#2ecc71,color:#000

spec-writer

Transforms user stories into testable requirements without reading implementation code.

Path boundaries: - ALLOWED: docs/, specs/, requirements/, *.md, domain documentation - FORBIDDEN: src/, lib/, app/, packages/, implementation code

Output: Structured specifications using EARS notation (Ubiquitous, Event-driven, State-driven, Unwanted behavior, Optional feature) with Given/When/Then acceptance criteria. Requirements use hierarchical IDs like SCORE-REQ-1.1 for traceability.

Example:

SCORE-REQ-1.3:
  Given: Two rockets with identical stability and speed
  When: One has altitude=1000 and one has altitude=100
  Then: Higher altitude rocket has higher score

isolated-test-writer

Writes tests from requirements alone, maintaining the RED phase of TDD. Tests must fail initially because no implementation exists yet.

Path boundaries: - ALLOWED: tests/, specs/, *.test.*, *.spec.*, test utilities, *.d.ts type definitions - FORBIDDEN: src/, lib/, app/, core/, services/, models/, handlers/, implementation modules

Output: Failing tests (RED phase) structured by requirement ID with behavioral assertions. Tests verify what code should do, not what it currently does.

Example:

describe("Requirement: SCORE-REQ-1.3 - Higher altitude produces higher score", () => {
  test("should score higher altitude better than lower altitude", () => {
    const lowScore = calculateScore(100, "stable", 5);
    const highScore = calculateScore(1000, "stable", 5);
    expect(highScore).toBeGreaterThan(lowScore);
  });
});

Contamination risk: If the agent reads implementation before writing tests, existing bugs become encoded as test expectations. The path restrictions prevent this mechanically.

Note: This agent is optional. Current models often produce slop (generic or tautological tests) when using strict isolation on autopilot. The agent works best with active human guidance and exceptionally clear requirements. For most scenarios, writing tests directly from specs using TDD principles achieves better quality with less overhead.

slop-test-reviewer

Identifies AI-generated test anti-patterns after tests are written. Detects tests that execute code but don't verify correctness.

Focus: Recently modified test files (unless explicitly instructed to review broader scope)

Seven detection patterns: 1. Mock abuse - assertions verify mocked values 2. Tautologies - tests assert what they just constructed 3. Existence-only checks - only verifying fields exist without checking values 4. Implementation mirroring - test logic recapitulates production logic 5. Happy path only - missing error cases and boundary tests 6. Copy-paste variations - near-identical tests with different data 7. Variable amnesia - undefined or inconsistent variable names

Output: Remediation guidance with before/after code examples.

How Path Restrictions Work

Each agent configuration defines ALLOWED and FORBIDDEN path patterns. When the agent attempts to read a file:

  1. Agent requests file access through Read tool
  2. Runtime checks path against FORBIDDEN patterns
  3. If match → access denied with explanation
  4. If no match → check ALLOWED patterns
  5. Access granted only if explicitly allowed

This creates a "pit of success"—the agent cannot accidentally violate isolation even if it wants to be helpful. The boundary is mechanical, not advisory.

graph TB
    subgraph spec["spec-writer"]
        SA["✓ ALLOWED<br/>───────<br/>docs/<br/>specs/<br/>requirements/<br/>*.md"]
        SF["✗ FORBIDDEN<br/>───────<br/>src/<br/>lib/<br/>app/<br/>packages/"]
    end

    subgraph test["isolated-test-writer"]
        TA["✓ ALLOWED<br/>───────<br/>tests/<br/>*.test.*<br/>*.spec.*<br/>*.d.ts"]
        TF["✗ FORBIDDEN<br/>───────<br/>src/<br/>lib/<br/>core/<br/>services/"]
    end

    subgraph review["slop-test-reviewer"]
        RA["✓ SCOPE<br/>───────<br/>Recently modified<br/>test files"]
        RF["✓ DETECTS<br/>───────<br/>7 slop patterns"]
    end

    style spec fill:#2980b9,color:#fff
    style test fill:#8e44ad,color:#fff
    style review fill:#16a085,color:#fff
    style SA fill:#27ae60,color:#fff
    style SF fill:#c0392b,color:#fff
    style TA fill:#27ae60,color:#fff
    style TF fill:#c0392b,color:#fff
    style RA fill:#27ae60,color:#fff
    style RF fill:#f39c12,color:#000

For detailed phase-by-phase implementation, see TDD + Mutation Example Workflow.

Choosing test types

Type Purpose Speed Example AI Weakness
Unit Single function behavior Fast calculateScore(100, "stable", 5) Over-mocks, misses invariants
Integration Component contracts Medium API + database + validation Weak error cases
E2E User workflows Slow Launch rocket, see score, reset Brittle selectors
Property Invariants across randomized inputs Medium Score always 0-100 Identifying meaningful properties

Test pyramid guidance: most tests should be unit and integration, with a small number of E2E tests for critical paths. E2E tests provide realism but are slower and more fragile than unit tests, so keep them focused on critical user workflows.

Decision triggers: - Verifying business logic? → Unit + property-based - Verifying component interactions? → Integration with real dependencies - Verifying user experience? → E2E with browser automation - Verifying invariants? → Property-based with generators

Common failure modes

Patterns that often show up in AI-written tests:

Mock abuse: Assertions verify mocked values.

const mockUser = { id: 123 };
when(service.get()).thenReturn(mockUser);
expect(result.id).toBe(123); // Always passes

Tautological: Values compared to themselves.

expect(user.name).toBe(user.name); // Useless

Existence-only: Checks fields exist without verifying values.

expect(response).toHaveProperty("userId"); // Weak
expect(response.userId).toBe(expectedId);   // Strong

Implementation mirroring: Test recapitulates code logic.

expect(calc(a, b)).toBe(a * 0.5 + bonus(b)); // Duplicates implementation

Happy path only: Missing error cases, boundary tests.

Copy-paste variations: 50 near-identical tests. Use test.each() for parameterization.

Variable amnesia: Undefined variables signal hallucination.

Detection: Mock count > assertion count. Only toHaveProperty checks. No toThrow tests. 10+ identical test structures.

Remediation details in tdd-example-workflow.md.

Mutation testing basics

Mutation testing introduces deliberate faults. If tests still pass, verification is weak.

Mechanics: 1. Change < to <= in source 2. Run test suite 3. Tests fail → mutant killed (good) 4. Tests pass → mutant survived (weak test) 5. Score = killed / total

Example:

// Original
if (day < 1) return false;

// Mutant
if (day <= 1) return false;

// Weak test (survives)
expect(isValid(0)).toBe(false); // Passes both

// Strong test (kills mutant)
expect(isValid(1)).toBe(true); // Fails mutant

Integration: after tests pass, run mutation testing. Survivors point to missing tests.

How this fits the workshop

The workflow relies on fast feedback tools (validation concepts) and hard enforcement via hooks (hooks) to keep tests isolated from implementation.

Workflow skills (skills) and specialized subagents keep responsibilities separated and reduce drift. The rules concepts section explains why hard enforcement beats best-effort instructions.

Next Steps

Read TDD + Mutation Example Workflow for phase-by-phase integration.

Examine raccoon-rocket-lab sandbox for working examples.

Adapt principles (isolation, forcing functions, mutation verification) to your tools and context.