Skip to content

Quality Engineering with AI Agents

TL;DR

AI-generated code requires review calibrated to its failure modes, not human failure modes. Agents produce confidently incorrect code, make context-boundary errors when reasoning spans multiple files, introduce security anti-patterns that pass superficial review, and co-generate tests that mask the very gaps they should expose. Traditional code review checklists are necessary but insufficient — quality engineering for AI-generated code demands new trust models, specialized review heuristics, and CI gates designed around agent-specific failure signatures.


The Trust Model

Why Agents Occupy a Novel Trust Tier

Agents are neither junior engineers nor compilers. They exhibit a unique combination of traits that demands a new trust model.

Trust Properties Comparison

PropertyCompilerAI AgentJunior EngineerSenior Engineer
DeterministicYesNoNoNo
Explains reasoningNoPlausible but unreliableHonest but incompleteAccurate
Asks for clarificationNeverRarely (proceeds with assumptions)FrequentlyWhen needed
Fails visiblyAlwaysRarely (fails silently)UsuallyUsually
AccountabilityN/ANonePersonalPersonal + team
Context windowUnlimited (scoped)Limited (token-bound)GrowingDeep
Confidence calibrationPerfectPoor (overconfident)VariesWell-calibrated

The Accountability Gap

Human engineer writes buggy code → learns from review, adjusts behavior
AI agent writes buggy code      → repeats identical mistakes next session

Human engineer skips edge case  → "I didn't think of that" (honest)
AI agent skips edge case        → "Here's the comprehensive solution" (confident)

This gap means review of AI-generated code must be more rigorous than review of junior engineer code, not less — despite the agent's output appearing more polished.

Calibrated Trust Rules

RuleRationale
Trust structure, verify logicAgents excel at boilerplate and scaffolding
Never trust error handlingAgents pattern-match happy paths
Never trust security boundariesAgents optimize for functionality over defense
Verify all external API usageHallucinated methods and wrong signatures are common
Treat co-generated tests as untrustedTests written alongside code inherit the same blind spots
Trust refactors more than new featuresRefactors have existing tests as safety nets

Failure Mode Taxonomy

Hallucination Categories

CategoryDescriptionExampleDetection MethodMitigation
API hallucinationInvents methods that do not existresponse.json().unwrap_or_default() on a type without that implCompilation / type checkingLock dependency versions, provide API docs in context
Version mismatchUses syntax or APIs from wrong versionReact class components in a hooks-only codebaseLinter rules, version-aware SASTPin framework version in system prompt
Confident logic errorProduces plausible but incorrect logicOff-by-one in pagination that passes basic testsMutation testing, property-based testsRequire edge-case test cases written by humans
Context boundary errorMisunderstands cross-file contractsCalls function with wrong argument order from another moduleIntegration tests, contract testsProvide interface definitions in context
Security anti-patternIntroduces vulnerability while solving functional requirementSQL string interpolation that "works"SAST scanners (Semgrep [7], CodeQL [8])Security-focused review checklist
Stale patternUses deprecated approachescomponentWillMount in React 18Deprecation lintersKeep context current, include migration guides
Test co-generation biasTests validate the bug, not the specTest asserts wrong output as correctMutation testing, spec reviewSeparate test authoring from implementation
Silent assumptionMakes undocumented assumptions about environmentAssumes UTC timezone, assumes Linux pathsEnvironment-specific CI matrixRequire assumption documentation

Detection Priority Matrix

Root Cause Analysis

Most agent failures trace to three root causes:

1. TRAINING DATA BIAS
   Agent learned from Stack Overflow circa 2021
   → Uses patterns that were popular then, not now
   → Mixes conventions from different ecosystems

2. CONTEXT WINDOW LIMITS
   Agent cannot see entire codebase simultaneously
   → Makes assumptions about modules outside its view
   → Duplicates utilities that already exist elsewhere

3. OPTIMIZATION FOR PLAUSIBILITY
   Agent optimizes for "looks correct" over "is correct"
   → Produces syntactically perfect, semantically wrong code
   → Generates tests that confirm the implementation, not the spec

Plan-Implement-Review Pipeline

Why Plan Approval is the Highest-Leverage Review Gate

Reviewing a plan takes 5 minutes. Reviewing an incorrect implementation takes 2 hours [15]. Rejecting a plan costs nothing. Rejecting a PR costs the agent's compute and the reviewer's time.

Plan Review Checklist

CheckQuestionWhy It Matters
ScopeDoes the plan match the task, no more, no less?Agents gold-plate and over-engineer
File listAre the files to be modified correct and complete?Agents miss files they haven't been shown
DependenciesAre new dependencies justified?Agents add libraries for trivial tasks
ApproachDoes the approach match team conventions?Agents use whatever pattern they learned most
Edge casesAre failure modes addressed in the plan?Agents plan for happy paths
SecurityAre auth/authz boundaries respected?Agents skip security by default
Testing strategyDoes the plan include test categories?Agents default to unit tests only

Structured Plan Format

markdown
## Plan: [Task Title]

### Goal
One sentence describing the desired outcome.

### Files to Modify
- `src/auth/handler.ts` — Add rate limiting middleware
- `src/auth/handler.test.ts` — Add rate limit test cases

### Files to Create
- `src/middleware/rate-limit.ts` — Rate limiter implementation

### Approach
1. Step one (with rationale)
2. Step two (with rationale)

### Edge Cases
- What happens when X?
- What happens when Y?

### Out of Scope
- Things explicitly not being done

### Dependencies
- None (or list with justification)

Anti-Pattern: Skipping Plan Review

❌ "Just implement the feature, I trust you"
   → Agent over-engineers the solution
   → Agent modifies files outside the intended scope
   → Agent introduces a dependency for a 3-line utility
   → 2 hours of review to untangle

✅ "Write a plan first, then wait for approval"
   → 5-minute plan review catches scope creep
   → Implementation aligns with team conventions
   → Review focuses on correctness, not approach

Reviewing AI Diffs at Scale

What Agents Get Right

CategoryReliabilityReview Depth Needed
File structure and boilerplateHighSkim
Type definitions and interfacesHighSkim
Import organizationHighSkip (linter handles)
Documentation stringsMedium-HighVerify accuracy
Standard CRUD operationsMedium-HighVerify edge cases
Test scaffoldingMediumVerify assertions
Configuration filesMediumVerify values

What Agents Get Wrong

CategoryFailure RateReview Depth Needed
Error handling and recoveryHighLine-by-line
Concurrency and race conditionsHighLine-by-line
Security boundariesHighLine-by-line + threat model
Cross-module integrationHighCross-file review
Performance under loadMedium-HighBenchmark
Edge cases and boundary conditionsMedium-HighProperty-based test review
Resource cleanup (connections, files)MediumCheck all code paths
Backward compatibilityMediumCheck all callers

Review Checklist by PR Type

New Feature PR

markdown
- [ ] Plan was reviewed and approved before implementation
- [ ] Feature flag wraps new functionality
- [ ] Error handling covers all failure modes (not just happy path)
- [ ] Input validation exists at trust boundary
- [ ] No hardcoded credentials, URLs, or environment-specific values
- [ ] Tests cover edge cases, not just golden path
- [ ] Tests were NOT generated by the same agent session (or mutation-tested)
- [ ] No unnecessary new dependencies added
- [ ] API contracts match existing conventions
- [ ] Database migrations are reversible

Refactoring PR

markdown
- [ ] Behavior is unchanged (verified by existing tests passing)
- [ ] No new features smuggled into the refactor
- [ ] Performance characteristics preserved
- [ ] All callers updated consistently
- [ ] No orphaned code left behind
- [ ] Import paths updated everywhere

Bug Fix PR

markdown
- [ ] Root cause identified and documented
- [ ] Fix addresses root cause, not symptom
- [ ] Regression test added BEFORE fix (proves the bug existed)
- [ ] Fix does not introduce new edge cases
- [ ] Related code paths checked for similar bugs
- [ ] Agent did not "fix" by rewriting the entire module

The 3-Pass Review Strategy

Pass 1: STRUCTURAL (2 minutes)
  - Right files modified?
  - Right approach taken?
  - Any unexpected changes?
  → If structural issues found, reject immediately

Pass 2: SEMANTIC (10 minutes)
  - Logic correct for all inputs?
  - Error handling complete?
  - Security boundaries intact?
  → Focus on what agents get wrong (see table above)

Pass 3: INTEGRATION (5 minutes)
  - Does this work with the rest of the system?
  - Any contract violations?
  - Any performance implications?
  → Cross-reference with modules outside the diff

Security Auditing

CWE Categories Agents Commonly Produce

AI agents are trained on vast codebases that include vulnerable code. They reproduce security anti-patterns with the same fluency as correct patterns, often without any indication that the output is dangerous. The following CWE categories [6] are among the most frequently reproduced.

CWE-78: OS Command Injection [1]

python
# ❌ Agent-generated: functional but vulnerable
import subprocess

def convert_image(filename: str, format: str) -> str:
    """Convert image to specified format."""
    output = f"{filename.rsplit('.', 1)[0]}.{format}"
    subprocess.run(f"convert {filename} {output}", shell=True)  # CWE-78
    return output

# ✅ Secure version
import subprocess
from pathlib import Path

ALLOWED_FORMATS = {"png", "jpg", "webp", "gif"}

def convert_image(filename: str, format: str) -> str:
    """Convert image to specified format."""
    if format not in ALLOWED_FORMATS:
        raise ValueError(f"Format must be one of {ALLOWED_FORMATS}")

    input_path = Path(filename)
    if not input_path.exists():
        raise FileNotFoundError(f"Input file not found: {filename}")

    output = str(input_path.with_suffix(f".{format}"))
    subprocess.run(
        ["convert", str(input_path), output],  # No shell=True, list args
        check=True,
        timeout=30,
    )
    return output

CWE-89: SQL Injection [2]

python
# ❌ Agent-generated: works for demo, vulnerable in production
def get_user(db, username: str):
    query = f"SELECT * FROM users WHERE username = '{username}'"  # CWE-89
    return db.execute(query).fetchone()

# ✅ Secure version
def get_user(db, username: str):
    query = "SELECT * FROM users WHERE username = ?"
    return db.execute(query, (username,)).fetchone()

CWE-798: Hardcoded Credentials [3]

python
# ❌ Agent-generated: "for testing" that ships to production
class DatabaseConfig:
    HOST = "db.internal.company.com"
    PORT = 5432
    USER = "admin"
    PASSWORD = "admin123"  # CWE-798

# ✅ Secure version
import os

class DatabaseConfig:
    HOST = os.environ["DB_HOST"]
    PORT = int(os.environ.get("DB_PORT", "5432"))
    USER = os.environ["DB_USER"]
    PASSWORD = os.environ["DB_PASSWORD"]

CWE-20: Improper Input Validation [4]

python
# ❌ Agent-generated: trusts user input
def set_profile_picture(user_id: int, url: str):
    """Download and set user's profile picture."""
    response = requests.get(url)  # CWE-20: no URL validation
    save_image(user_id, response.content)

# ✅ Secure version
from urllib.parse import urlparse

ALLOWED_SCHEMES = {"https"}
ALLOWED_HOSTS = {"cdn.example.com", "images.example.com"}
MAX_SIZE = 5 * 1024 * 1024  # 5 MB

def set_profile_picture(user_id: int, url: str):
    """Download and set user's profile picture."""
    parsed = urlparse(url)
    if parsed.scheme not in ALLOWED_SCHEMES:
        raise ValueError("Only HTTPS URLs are allowed")
    if parsed.hostname not in ALLOWED_HOSTS:
        raise ValueError("URL host not in allowlist")

    response = requests.get(url, timeout=10, stream=True)
    content = response.content
    if len(content) > MAX_SIZE:
        raise ValueError(f"Image exceeds maximum size of {MAX_SIZE} bytes")

    validate_image_content(content)  # Check magic bytes
    save_image(user_id, content)

CWE-276: Incorrect Default Permissions [5]

python
# ❌ Agent-generated: overly permissive defaults
def create_config_file(path: str, content: str):
    with open(path, "w") as f:
        f.write(content)
    os.chmod(path, 0o777)  # CWE-276: world-readable/writable

# ✅ Secure version
def create_config_file(path: str, content: str):
    fd = os.open(path, os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o600)
    with os.fdopen(fd, "w") as f:
        f.write(content)

Automated SAST as Post-Agent Gate

yaml
# .github/workflows/agent-security-gate.yml
name: Agent Security Gate
on:
  pull_request:
    types: [opened, synchronize]

jobs:
  sast-scan:
    runs-on: ubuntu-latest
    if: contains(github.event.pull_request.labels.*.name, 'ai-generated')
    steps:
      - uses: actions/checkout@v4

      - name: Run Semgrep  # [7]
        uses: semgrep/semgrep-action@v1
        with:
          config: >-
            p/owasp-top-ten  # [14]
            p/cwe-top-25
            p/security-audit
          publishToken: ${{ secrets.SEMGREP_APP_TOKEN }}

      - name: Run CodeQL  # [8]
        uses: github/codeql-action/analyze@v3
        with:
          languages: ${{ matrix.language }}

      - name: Secret Scanner  # [9]
        uses: trufflesecurity/trufflehog@v3
        with:
          path: .
          extra_args: --only-verified

      - name: Dependency Audit
        run: |
          npm audit --audit-level=high || true
          pip-audit --strict || true

Security Review Decision Table

SignalAction
PR labeled ai-generated + touches auth codeMandatory security team review
PR labeled ai-generated + adds dependencyDependency review + license check
PR labeled ai-generated + modifies infraInfrastructure team review
SAST findings ≥ 1 high severityBlock merge
PR labeled ai-generated + no SAST runBlock merge
PR modifies .env or secrets configBlock merge + alert

The Co-Generation Problem

The Fundamental Issue

When an agent writes both the implementation and the tests, the tests validate what the code does, not what the code should do [10]. The tests and the code share the same blind spots.

Example: The Invisible Off-by-One

python
# Agent-generated implementation
def paginate(items: list, page: int, per_page: int) -> list:
    """Return items for the given page."""
    start = page * per_page        # Bug: page 1 starts at per_page, not 0
    end = start + per_page
    return items[start:end]

# Agent-generated test (passes! but validates the bug)
def test_paginate():
    items = list(range(20))
    assert paginate(items, 0, 5) == [0, 1, 2, 3, 4]     # "page 0" works
    assert paginate(items, 1, 5) == [5, 6, 7, 8, 9]     # "page 1" works
    assert paginate(items, 3, 5) == [15, 16, 17, 18, 19] # "page 3" works
    # Missing: what does the API contract say? Is page 0-indexed or 1-indexed?
    # Missing: empty page, negative page, page beyond bounds

The test passes. The implementation is internally consistent. But if the spec says pages are 1-indexed (as most user-facing APIs are), both the code and the test are wrong.

Detection Strategies

1. Mutation Testing

Mutation testing [11] modifies the source code and checks if tests catch the mutations. Co-generated tests often have low mutation kill rates because they test surface behavior, not invariants.

bash
# Python: mutmut [12]
mutmut run --paths-to-mutate=src/ --tests-dir=tests/

# JavaScript: Stryker [13]
npx stryker run

# Interpret results
# Mutation score < 70% on agent-generated code → tests are weak
# Surviving mutants reveal specific blind spots

2. Human-Written Property Tests

Property-based tests [16] encode invariants that must hold for all inputs, not specific input-output pairs.

python
from hypothesis import given, strategies as st

@given(
    items=st.lists(st.integers()),
    page=st.integers(min_value=1, max_value=100),
    per_page=st.integers(min_value=1, max_value=50),
)
def test_paginate_properties(items, page, per_page):
    result = paginate(items, page, per_page)

    # Property 1: result length ≤ per_page
    assert len(result) <= per_page

    # Property 2: all returned items exist in original list
    for item in result:
        assert item in items

    # Property 3: no overlap between pages
    if page > 1:
        prev_page = paginate(items, page - 1, per_page)
        assert set(map(id, result)).isdisjoint(set(map(id, prev_page)))

    # Property 4: union of all pages equals original list
    all_pages = []
    for p in range(1, (len(items) // per_page) + 2):
        all_pages.extend(paginate(items, p, per_page))
    assert all_pages == items  # This WILL catch the 0-vs-1 indexing bug

3. Spec-Derived Test Oracles

Extract test expectations directly from specifications, not from the implementation.

python
# Oracle derived from API spec, written by human
class PaginationOracle:
    """
    From API spec v2.3:
    - Pages are 1-indexed
    - Page 0 returns 400
    - Page beyond range returns empty list
    - Total count is included in response
    """

    @staticmethod
    def expected_items(total: int, page: int, per_page: int) -> int:
        if page < 1:
            raise ValueError("Page must be ≥ 1")
        start = (page - 1) * per_page
        return max(0, min(per_page, total - start))

Co-Generation Detection Checklist

SignalIndicates Co-Generation Bias
All tests pass on first runTests may be validating implementation, not spec
No edge case testsAgent tested happy path only
Test values mirror implementation constantsAgent copied its own assumptions
No error path testsAgent didn't consider failures
Mutation score < 70%Tests are shallow
No property-based testsNo invariant validation

Quality Gates in CI/CD

Agent-Aware CI Pipeline

yaml
# .github/workflows/ai-code-quality.yml
name: AI Code Quality Gates
on:
  pull_request:
    types: [opened, synchronize]

env:
  AI_LABEL: "ai-generated"

jobs:
  classify-pr:
    runs-on: ubuntu-latest
    outputs:
      is_ai: ${{ steps.check.outputs.is_ai }}
    steps:
      - id: check
        run: |
          if echo "${{ toJson(github.event.pull_request.labels) }}" | \
             jq -e '.[] | select(.name == "ai-generated")' > /dev/null 2>&1; then
            echo "is_ai=true" >> "$GITHUB_OUTPUT"
          else
            echo "is_ai=false" >> "$GITHUB_OUTPUT"
          fi

  standard-checks:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Lint
        run: npm run lint
      - name: Type Check
        run: npm run typecheck
      - name: Unit Tests
        run: npm test -- --coverage
      - name: Coverage Gate
        run: |
          COVERAGE=$(jq '.total.lines.pct' coverage/coverage-summary.json)
          if (( $(echo "$COVERAGE < 80" | bc -l) )); then
            echo "Coverage $COVERAGE% is below 80% threshold"
            exit 1
          fi

  ai-specific-checks:
    needs: classify-pr
    if: needs.classify-pr.outputs.is_ai == 'true'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Mutation Testing
        run: |
          npx stryker run --reporters clear-text,json
          SCORE=$(jq '.schemaVersion' reports/mutation/mutation.json)
          echo "Mutation score: $SCORE%"

      - name: SAST Scan
        uses: semgrep/semgrep-action@v1
        with:
          config: p/owasp-top-ten

      - name: Dependency Audit
        run: npm audit --audit-level=high

      - name: Check for Hardcoded Secrets
        uses: trufflesecurity/trufflehog@v3
        with:
          extra_args: --only-verified

      - name: Verify Test Independence
        run: |
          # Check that tests were not generated in the same commit as source
          CHANGED_SRC=$(git diff --name-only origin/main... -- 'src/**' ':!src/**/*.test.*')
          CHANGED_TESTS=$(git diff --name-only origin/main... -- '**/*.test.*' '**/*.spec.*')

          if [ -n "$CHANGED_SRC" ] && [ -n "$CHANGED_TESTS" ]; then
            echo "::warning::Source and tests modified in same PR — verify test independence"
          fi

  security-review-gate:
    needs: [classify-pr, ai-specific-checks]
    if: |
      needs.classify-pr.outputs.is_ai == 'true' &&
      contains(github.event.pull_request.changed_files, 'auth') ||
      contains(github.event.pull_request.changed_files, 'security')
    runs-on: ubuntu-latest
    steps:
      - name: Require Security Team Review
        uses: actions/github-script@v7
        with:
          script: |
            await github.rest.pulls.requestReviewers({
              owner: context.repo.owner,
              repo: context.repo.repo,
              pull_number: context.issue.number,
              team_reviewers: ['security-team']
            });

Gate Progression


The Human-in-the-Loop Spectrum

Not All Tasks Need the Same Oversight

The appropriate level of human involvement depends on the risk of the task and the quality of existing test coverage.

Decision Matrix

Oversight Levels Defined

LevelNameWhen to UseHuman TimeExamples
L0AutonomousTrivial, well-tested, reversible0 minFormatting, import sorting
L1Spot CheckLow risk, high coverage, conventional2 minDoc updates, type additions
L2Standard ReviewMedium risk, standard patterns10 minNew CRUD endpoint, refactor
L3Deep ReviewHigh risk, complex logic30 minBusiness logic, data pipeline
L4Pair ReviewCritical path, security, compliance60+ minAuth, payments, PII handling

Selecting the Right Level

python
def determine_oversight_level(task) -> str:
    """Decision tree for oversight level selection."""

    # L4: Always full review
    if task.touches_auth or task.touches_payments or task.handles_pii:
        return "L4_PAIR_REVIEW"

    # L3: Deep review for complex or poorly tested areas
    if task.risk == "high" or task.test_coverage < 0.5:
        return "L3_DEEP_REVIEW"

    # L2: Standard review for most feature work
    if task.is_new_feature or task.modifies_api_contract:
        return "L2_STANDARD_REVIEW"

    # L1: Spot check for safe, well-covered changes
    if task.test_coverage > 0.8 and task.risk == "low":
        return "L1_SPOT_CHECK"

    # L0: Autonomous for trivial changes
    if task.is_formatting_only or task.is_documentation_only:
        return "L0_AUTONOMOUS"

    return "L2_STANDARD_REVIEW"  # Default to standard

Anti-Pattern: Uniform Oversight

❌ Review every AI PR with the same intensity
   → Reviewer fatigue on trivial PRs
   → Insufficient attention on critical PRs
   → Bottleneck on the review queue

✅ Calibrate oversight to risk × coverage
   → Trivial changes flow through automated gates
   → Critical changes get proportional attention
   → Reviewer energy spent where it matters

Metrics

What to Measure

Effective quality engineering for AI-generated code requires tracking metrics that distinguish between human and AI sources to calibrate trust over time.

Defect Density by Source

MetricFormulaTargetAction if Exceeded
Agent defect densityAgent bugs / Agent KLOC≤ 2× human baselineIncrease review depth
Agent security defect densityAgent security bugs / Agent KLOC≤ 1× human baselineAdd SAST rules
Co-generation escape rateBugs in co-generated test+code / Total co-gen PRs< 5%Require mutation testing
Context boundary error rateCross-module bugs / Multi-file PRs< 10%Improve context feeding

Review Efficiency

MetricFormulaTarget
Review cycle time (AI PRs)Median time from PR open to merge< 4 hours
Review cycle time (human PRs)Median time from PR open to merge< 8 hours
Revision rounds (AI PRs)Average number of review cycles< 1.5
Revision rounds (human PRs)Average number of review cycles< 2.0
Plan rejection rateRejected plans / Total plans10-30% (too low = rubber stamping)

Post-Merge Quality

MetricFormulaTargetSignal
Post-merge revert rate (AI)Reverted AI PRs / Total AI PRs< 3%Implementation quality
Post-merge revert rate (human)Reverted human PRs / Total human PRs< 2%Baseline
MTTR for AI bugsMedian time to resolve AI-introduced bugs< 2 hoursDebuggability
Escaped defect ratioProduction bugs from AI / Total production bugsProportional to AI contribution %Systemic issues

Tracking Dashboard

sql
-- Example query for tracking AI vs human defect density
SELECT
    source_type,                           -- 'ai_agent' or 'human'
    COUNT(DISTINCT pr.id) AS total_prs,
    SUM(pr.lines_changed) / 1000.0 AS kloc,
    COUNT(DISTINCT bug.id) AS bugs_found,
    COUNT(DISTINCT bug.id) / (SUM(pr.lines_changed) / 1000.0) AS defect_density,
    AVG(pr.review_cycle_hours) AS avg_review_hours,
    SUM(CASE WHEN pr.reverted THEN 1 ELSE 0 END)::float
        / COUNT(DISTINCT pr.id) AS revert_rate
FROM pull_requests pr
LEFT JOIN bugs bug ON bug.source_pr_id = pr.id
WHERE pr.merged_at >= NOW() - INTERVAL '30 days'
GROUP BY source_type;

Metric Interpretation Guide

ObservationRoot CauseAction
AI defect density > 3× humanAgent lacks domain contextImprove system prompts and context feeding
High plan rejection rate (> 50%)Tasks poorly specifiedImprove task specifications
Low plan rejection rate (< 5%)Rubber-stamp reviewingAdd plan review checklist, rotate reviewers
Co-gen escape rate increasingTest quality decliningMandate mutation testing for co-gen PRs
AI revert rate > 5%Insufficient review depthIncrease oversight level for affected areas
AI review cycle time > humanReviewers struggle with AI diffsTrain reviewers on AI-specific failure modes

Key Takeaways

  1. Trust is earned, not assumed. AI agents occupy a novel position between compilers and junior engineers. Their fluency masks their unreliability. Calibrate review intensity to their actual failure modes, not their apparent competence.

  2. Plan review is the highest-leverage gate. Five minutes reviewing a plan saves hours reviewing a wrong implementation. Always require plan approval before agent implementation.

  3. Co-generated tests are suspect by default. When the agent writes both code and tests, the tests share the code's blind spots. Use mutation testing and human-written property tests to verify test quality.

  4. Security vulnerabilities are a first-class concern. Agents reproduce CWE patterns (command injection, SQL injection, hardcoded credentials) as fluently as they produce correct code. SAST scanning is mandatory, not optional.

  5. Oversight should be proportional to risk. Not every AI PR needs deep review. Use the risk × coverage matrix to allocate human attention where it has the highest impact.

  6. Measure and compare. Track defect density, revert rate, and review cycle time by source (AI vs human). Use the data to calibrate trust levels and improve the process over time.

  7. Automate the automatable. Linting, type checking, SAST, secret scanning, and mutation testing can all run in CI. Reserve human reviewers for judgment calls that tools cannot make.

  8. The goal is not to eliminate AI code generation. The goal is to build a quality engineering practice that makes AI-generated code as reliable as human-generated code — with appropriate verification at every stage.


References

  1. MITRE — CWE-78: Improper Neutralization of Special Elements used in an OS Command
  2. MITRE — CWE-89: Improper Neutralization of Special Elements used in an SQL Command
  3. MITRE — CWE-798: Use of Hard-coded Credentials
  4. MITRE — CWE-20: Improper Input Validation
  5. MITRE — CWE-276: Incorrect Default Permissions
  6. MITRE — Common Weakness Enumeration (CWE)
  7. Semgrep — Static Analysis at Ludicrous Speed
  8. GitHub — CodeQL Code Scanning
  9. TruffleHog — Find and Verify Credentials
  10. Anthropic — Building Effective Agents: Evaluating AI-Generated Code
  11. Wikipedia — Mutation Testing
  12. mutmut — Python Mutation Testing
  13. Stryker Mutator — JavaScript/TypeScript Mutation Testing
  14. OWASP — Top 10 Web Application Security Risks
  15. Anthropic — Claude Code: Best Practices for Agentic Coding
  16. Hypothesis — Property-Based Testing for Python

Built with depth by the Babushkai community. Released under the MIT License.