agskills.dev
MARKETPLACE

eval-harness

Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles

affaan-m51.6k6.3k

Vista Previa

SKILL.md
Metadata
name
eval-harness
description
Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles
origin
ECC
tools
Read, Write, Edit, Bash, Grep, Glob

Eval Harness Skill

A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.

When to Activate

  • Setting up eval-driven development (EDD) for AI-assisted workflows
  • Defining pass/fail criteria for Claude Code task completion
  • Measuring agent reliability with pass@k metrics
  • Creating regression test suites for prompt or agent changes
  • Benchmarking agent performance across model versions

Philosophy

Eval-Driven Development treats evals as the "unit tests of AI development":

  • Define expected behavior BEFORE implementation
  • Run evals continuously during development
  • Track regressions with each change
  • Use pass@k metrics for reliability measurement

Eval Types

Capability Evals

Test if Claude can do something it couldn't before:

[CAPABILITY EVAL: feature-name] Task: Description of what Claude should accomplish Success Criteria: - [ ] Criterion 1 - [ ] Criterion 2 - [ ] Criterion 3 Expected Output: Description of expected result

Regression Evals

Ensure changes don't break existing functionality:

[REGRESSION EVAL: feature-name] Baseline: SHA or checkpoint name Tests: - existing-test-1: PASS/FAIL - existing-test-2: PASS/FAIL - existing-test-3: PASS/FAIL Result: X/Y passed (previously Y/Y)

Grader Types

1. Code-Based Grader

Deterministic checks using code:

# Check if file contains expected pattern grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL" # Check if tests pass npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL" # Check if build succeeds npm run build && echo "PASS" || echo "FAIL"

2. Model-Based Grader

Use Claude to evaluate open-ended outputs:

[MODEL GRADER PROMPT] Evaluate the following code change: 1. Does it solve the stated problem? 2. Is it well-structured? 3. Are edge cases handled? 4. Is error handling appropriate? Score: 1-5 (1=poor, 5=excellent) Reasoning: [explanation]

3. Human Grader

Flag for manual review:

[HUMAN REVIEW REQUIRED] Change: Description of what changed Reason: Why human review is needed Risk Level: LOW/MEDIUM/HIGH

Metrics

pass@k

"At least one success in k attempts"

  • pass@1: First attempt success rate
  • pass@3: Success within 3 attempts
  • Typical target: pass@3 > 90%

pass^k

"All k trials succeed"

  • Higher bar for reliability
  • pass^3: 3 consecutive successes
  • Use for critical paths

Eval Workflow

1. Define (Before Coding)

## EVAL DEFINITION: feature-xyz ### Capability Evals 1. Can create new user account 2. Can validate email format 3. Can hash password securely ### Regression Evals 1. Existing login still works 2. Session management unchanged 3. Logout flow intact ### Success Metrics - pass@3 > 90% for capability evals - pass^3 = 100% for regression evals

2. Implement

Write code to pass the defined evals.

3. Evaluate

# Run capability evals [Run each capability eval, record PASS/FAIL] # Run regression evals npm test -- --testPathPattern="existing" # Generate report

4. Report

EVAL REPORT: feature-xyz ======================== Capability Evals: create-user: PASS (pass@1) validate-email: PASS (pass@2) hash-password: PASS (pass@1) Overall: 3/3 passed Regression Evals: login-flow: PASS session-mgmt: PASS logout-flow: PASS Overall: 3/3 passed Metrics: pass@1: 67% (2/3) pass@3: 100% (3/3) Status: READY FOR REVIEW

Integration Patterns

Pre-Implementation

/eval define feature-name

Creates eval definition file at .claude/evals/feature-name.md

During Implementation

/eval check feature-name

Runs current evals and reports status

Post-Implementation

/eval report feature-name

Generates full eval report

Eval Storage

Store evals in project:

.claude/
  evals/
    feature-xyz.md      # Eval definition
    feature-xyz.log     # Eval run history
    baseline.json       # Regression baselines

Best Practices

  1. Define evals BEFORE coding - Forces clear thinking about success criteria
  2. Run evals frequently - Catch regressions early
  3. Track pass@k over time - Monitor reliability trends
  4. Use code graders when possible - Deterministic > probabilistic
  5. Human review for security - Never fully automate security checks
  6. Keep evals fast - Slow evals don't get run
  7. Version evals with code - Evals are first-class artifacts

Example: Adding Authentication

## EVAL: add-authentication ### Phase 1: Define (10 min) Capability Evals: - [ ] User can register with email/password - [ ] User can login with valid credentials - [ ] Invalid credentials rejected with proper error - [ ] Sessions persist across page reloads - [ ] Logout clears session Regression Evals: - [ ] Public routes still accessible - [ ] API responses unchanged - [ ] Database schema compatible ### Phase 2: Implement (varies) [Write code] ### Phase 3: Evaluate Run: /eval check add-authentication ### Phase 4: Report EVAL REPORT: add-authentication ============================== Capability: 5/5 passed (pass@3: 100%) Regression: 3/3 passed (pass^3: 100%) Status: SHIP IT