Testing Concepts for AI-Assisted Development¶
AI can generate tests faster than humans can review them. In practice, many AI-generated tests execute code without checking correctness.
The practical risk is simple: AI tends to document what code does, while humans need tests that specify what code should do. If AI reads implementation before writing tests, existing bugs can become test expectations.
Approaches when using AI¶
Several approaches work, depending on what you are trying to validate.
- TDD + mutation (this workshop): write failing tests from requirements, implement to pass, then verify with mutation testing. Works well for new features with clear requirements and strong isolation rules.
- Mutation-first: implement code, generate mutants, and add tests to kill survivors. Useful for characterizing existing code.
- Property-based: define invariants and generate randomized inputs. Best for algorithms and data structures.
- Approval/snapshot: capture current behavior and review diffs. Useful for refactoring and legacy code.
- REPL-driven: explore behavior interactively, then capture cases as tests. Useful when learning a new domain.
You can mix these; for example, TDD for a feature and property-based tests inside its core algorithm.
Why isolation matters¶
Without checkpoints, AI can add unrequested behavior, miss edge cases, or remove tests that fail.
Context isolation is the main guardrail. Write tests from requirements, then implement after tests exist. Avoid reading implementation before test writing.
Mutation testing then exposes weak verification by revealing survivors that tests did not catch.
Role separation and enforcement guardrails help keep tests from drifting toward implementation details.
Fast feedback keeps the loop tolerable. The examples later in this section show the style in practice.
Forcing Functions¶
Telling an AI "don't read implementation code" is probabilistic. The model may misinterpret, overlook the instruction buried in context, or prioritize being "helpful" by reading code to write better tests.
Hard enforcement moves the boundary from the prompt layer to the execution layer. Instead of asking the AI to respect isolation, we mechanically block the actions that would violate it.
flowchart LR
A[Advisory<br/>Instruction] --> B{Model<br/>Interprets}
B -->|Understands| C[Follows Rule]
B -->|Misinterprets| D[Violates Rule]
B -->|Prioritizes<br/>Helpfulness| D
E[Mechanical<br/>Enforcement] --> F{Tool Call}
F --> G[Runtime<br/>Checks Path]
G -->|Forbidden| H[Access<br/>Denied]
G -->|Allowed| I[Access<br/>Granted]
style A fill:#c0392b,color:#fff
style E fill:#27ae60,color:#fff
style D fill:#e74c3c,color:#fff
style H fill:#2ecc71,color:#000
Three-Agent Workflow (Optional)¶
For situations requiring strict context isolation, the TDD workflow can use three specialized agents with hard path boundaries:
flowchart TD
A[Requirements] --> B[spec-writer]
B --> C[Spec]
C --> D[isolated-test-writer]
D --> E[Tests RED]
E --> F[Implementation]
F --> G[Tests GREEN]
G --> H[Mutation Testing]
H --> I{Survivors?}
I -->|Yes| D
I -->|No| J[slop-test-reviewer]
J --> K{Issues?}
K -->|Yes| D
K -->|No| L[COMMIT]
style B fill:#3498db,color:#fff
style D fill:#3498db,color:#fff
style J fill:#3498db,color:#fff
style E fill:#e74c3c,color:#fff
style G fill:#27ae60,color:#fff
style L fill:#2ecc71,color:#000
spec-writer¶
Transforms user stories into testable requirements without reading implementation code.
Path boundaries:
- ALLOWED: docs/, specs/, requirements/, *.md, domain documentation
- FORBIDDEN: src/, lib/, app/, packages/, implementation code
Output: Structured specifications using EARS notation (Ubiquitous, Event-driven, State-driven, Unwanted behavior, Optional feature) with Given/When/Then acceptance criteria. Requirements use hierarchical IDs like SCORE-REQ-1.1 for traceability.
Example:
SCORE-REQ-1.3:
Given: Two rockets with identical stability and speed
When: One has altitude=1000 and one has altitude=100
Then: Higher altitude rocket has higher score
isolated-test-writer¶
Writes tests from requirements alone, maintaining the RED phase of TDD. Tests must fail initially because no implementation exists yet.
Path boundaries:
- ALLOWED: tests/, specs/, *.test.*, *.spec.*, test utilities, *.d.ts type definitions
- FORBIDDEN: src/, lib/, app/, core/, services/, models/, handlers/, implementation modules
Output: Failing tests (RED phase) structured by requirement ID with behavioral assertions. Tests verify what code should do, not what it currently does.
Example:
describe("Requirement: SCORE-REQ-1.3 - Higher altitude produces higher score", () => {
test("should score higher altitude better than lower altitude", () => {
const lowScore = calculateScore(100, "stable", 5);
const highScore = calculateScore(1000, "stable", 5);
expect(highScore).toBeGreaterThan(lowScore);
});
});
Contamination risk: If the agent reads implementation before writing tests, existing bugs become encoded as test expectations. The path restrictions prevent this mechanically.
Note: This agent is optional. Current models often produce slop (generic or tautological tests) when using strict isolation on autopilot. The agent works best with active human guidance and exceptionally clear requirements. For most scenarios, writing tests directly from specs using TDD principles achieves better quality with less overhead.
slop-test-reviewer¶
Identifies AI-generated test anti-patterns after tests are written. Detects tests that execute code but don't verify correctness.
Focus: Recently modified test files (unless explicitly instructed to review broader scope)
Seven detection patterns: 1. Mock abuse - assertions verify mocked values 2. Tautologies - tests assert what they just constructed 3. Existence-only checks - only verifying fields exist without checking values 4. Implementation mirroring - test logic recapitulates production logic 5. Happy path only - missing error cases and boundary tests 6. Copy-paste variations - near-identical tests with different data 7. Variable amnesia - undefined or inconsistent variable names
Output: Remediation guidance with before/after code examples.
How Path Restrictions Work¶
Each agent configuration defines ALLOWED and FORBIDDEN path patterns. When the agent attempts to read a file:
- Agent requests file access through Read tool
- Runtime checks path against FORBIDDEN patterns
- If match → access denied with explanation
- If no match → check ALLOWED patterns
- Access granted only if explicitly allowed
This creates a "pit of success"—the agent cannot accidentally violate isolation even if it wants to be helpful. The boundary is mechanical, not advisory.
graph TB
subgraph spec["spec-writer"]
SA["✓ ALLOWED<br/>───────<br/>docs/<br/>specs/<br/>requirements/<br/>*.md"]
SF["✗ FORBIDDEN<br/>───────<br/>src/<br/>lib/<br/>app/<br/>packages/"]
end
subgraph test["isolated-test-writer"]
TA["✓ ALLOWED<br/>───────<br/>tests/<br/>*.test.*<br/>*.spec.*<br/>*.d.ts"]
TF["✗ FORBIDDEN<br/>───────<br/>src/<br/>lib/<br/>core/<br/>services/"]
end
subgraph review["slop-test-reviewer"]
RA["✓ SCOPE<br/>───────<br/>Recently modified<br/>test files"]
RF["✓ DETECTS<br/>───────<br/>7 slop patterns"]
end
style spec fill:#2980b9,color:#fff
style test fill:#8e44ad,color:#fff
style review fill:#16a085,color:#fff
style SA fill:#27ae60,color:#fff
style SF fill:#c0392b,color:#fff
style TA fill:#27ae60,color:#fff
style TF fill:#c0392b,color:#fff
style RA fill:#27ae60,color:#fff
style RF fill:#f39c12,color:#000
For detailed phase-by-phase implementation, see TDD + Mutation Example Workflow.
Choosing test types¶
| Type | Purpose | Speed | Example | AI Weakness |
|---|---|---|---|---|
| Unit | Single function behavior | Fast | calculateScore(100, "stable", 5) |
Over-mocks, misses invariants |
| Integration | Component contracts | Medium | API + database + validation | Weak error cases |
| E2E | User workflows | Slow | Launch rocket, see score, reset | Brittle selectors |
| Property | Invariants across randomized inputs | Medium | Score always 0-100 | Identifying meaningful properties |
Test pyramid guidance: most tests should be unit and integration, with a small number of E2E tests for critical paths. E2E tests provide realism but are slower and more fragile than unit tests, so keep them focused on critical user workflows.
Decision triggers: - Verifying business logic? → Unit + property-based - Verifying component interactions? → Integration with real dependencies - Verifying user experience? → E2E with browser automation - Verifying invariants? → Property-based with generators
Common failure modes¶
Patterns that often show up in AI-written tests:
Mock abuse: Assertions verify mocked values.
const mockUser = { id: 123 };
when(service.get()).thenReturn(mockUser);
expect(result.id).toBe(123); // Always passes
Tautological: Values compared to themselves.
Existence-only: Checks fields exist without verifying values.
expect(response).toHaveProperty("userId"); // Weak
expect(response.userId).toBe(expectedId); // Strong
Implementation mirroring: Test recapitulates code logic.
Happy path only: Missing error cases, boundary tests.
Copy-paste variations: 50 near-identical tests. Use test.each() for parameterization.
Variable amnesia: Undefined variables signal hallucination.
Detection: Mock count > assertion count. Only toHaveProperty checks. No toThrow tests. 10+ identical test structures.
Remediation details in tdd-example-workflow.md.
Mutation testing basics¶
Mutation testing introduces deliberate faults. If tests still pass, verification is weak.
Mechanics:
1. Change < to <= in source
2. Run test suite
3. Tests fail → mutant killed (good)
4. Tests pass → mutant survived (weak test)
5. Score = killed / total
Example:
// Original
if (day < 1) return false;
// Mutant
if (day <= 1) return false;
// Weak test (survives)
expect(isValid(0)).toBe(false); // Passes both
// Strong test (kills mutant)
expect(isValid(1)).toBe(true); // Fails mutant
Integration: after tests pass, run mutation testing. Survivors point to missing tests.
How this fits the workshop¶
The workflow relies on fast feedback tools (validation concepts) and hard enforcement via hooks (hooks) to keep tests isolated from implementation.
Workflow skills (skills) and specialized subagents keep responsibilities separated and reduce drift. The rules concepts section explains why hard enforcement beats best-effort instructions.
Next Steps¶
Read TDD + Mutation Example Workflow for phase-by-phase integration.
Examine raccoon-rocket-lab sandbox for working examples.
Adapt principles (isolation, forcing functions, mutation verification) to your tools and context.