npx skills add https://github.com/vercel-labs/agent-eval --skill frontend-designInstallez cette compétence avec la CLI et commencez à utiliser le flux de travail SKILL.md dans votre espace de travail.
Test AI coding agents on your framework. Measure what actually works.
You're building a frontend framework and want AI agents to work well with it. But how do you know if:
This framework gives you answers. Run controlled experiments, measure pass rates, compare techniques.
# Create a new eval project
npx @vercel/agent-eval init my-agent-evals
cd my-agent-evals
# Install dependencies
npm install
# Add your API keys
cp .env.example .env
# Edit .env with your AI_GATEWAY_API_KEY and VERCEL_TOKEN
# Preview what will run (no API calls, no cost)
npx @vercel/agent-eval --dry
# Run all experiments
npx @vercel/agent-eval
npx @vercel/agent-eval
With no arguments, the CLI discovers every experiments/*.ts file and runs them all. Each experiment runs in parallel. Results with matching fingerprints are reused automatically (see Result Reuse).
npx @vercel/agent-eval cc
The argument is the experiment filename without .ts. This resolves to experiments/cc.ts.
| Flag | Description |
|---|---|
--dry |
Preview what would run without executing. No API calls, no cost. |
--smoke |
Quick setup verification. Picks the first eval alphabetically, runs once per model. |
--force |
Ignore cached fingerprints and re-run everything. Only applies when running all. |
--ack-failures |
Keep non-model failures as final results instead of retrying or deleting them. |
--max-retries <n> |
Maximum retry rounds for non-model failures (default: 5). Set to 0 to disable retries. |
Flags work with both modes:
npx @vercel/agent-eval --dry # preview all experiments
npx @vercel/agent-eval cc --dry # preview a single experiment
npx @vercel/agent-eval --smoke # smoke test all experiments
npx @vercel/agent-eval cc --smoke # smoke test one experiment
npx @vercel/agent-eval init <name> # scaffold a new eval project
npx @vercel/agent-eval playground # launch web-based results viewer
npx @vercel/agent-eval playground --watch # live mode (watches for new results)
Each eval tests one specific task an agent should be able to do with your framework.
evals/
create-button-component/
PROMPT.md # Task for the agent
EVAL.ts # Tests to verify success (or EVAL.tsx for JSX)
package.json # Your framework as a dependency
src/ # Starter code
PROMPT.md -- what you want the agent to do:
Create a Button component using MyFramework.
Requirements:
- Export a Button component from src/components/Button.tsx
- Accept `label` and `onClick` props
- Use the framework's styling system for hover states
EVAL.ts -- how you verify it worked:
import { test, expect } from 'vitest';
import { readFileSync, existsSync } from 'fs';
import { execSync } from 'child_process';
test('Button component exists', () => {
expect(existsSync('src/components/Button.tsx')).toBe(true);
});
test('has required props', () => {
const content = readFileSync('src/components/Button.tsx', 'utf-8');
expect(content).toContain('label');
expect(content).toContain('onClick');
});
test('project builds', () => {
execSync('npm run build', { stdio: 'pipe' });
});
Use EVAL.tsx when your tests require JSX syntax (React Testing Library, component rendering). You only need one eval file per fixture -- choose .tsx if any test needs JSX.
EVAL.ts tests can assert not just on the files the agent produced, but on how it worked — which shell commands it ran, which files it read, how many tool calls it made, etc. The framework automatically parses the agent's transcript and writes the results to __agent_eval__/results.json in the sandbox before your tests run.
import { test, expect } from 'vitest';
import { readFileSync } from 'fs';
test('agent used the correct scaffolding command', () => {
const results = JSON.parse(readFileSync('__agent_eval__/results.json', 'utf-8'));
const commands = results.o11y.shellCommands.map((c: { command: string }) => c.command);
expect(commands).toContain('npx create-next-app project');
});
test('agent did not make excessive tool calls', () => {
const results = JSON.parse(readFileSync('__agent_eval__/results.json', 'utf-8'));
expect(results.o11y.totalToolCalls).toBeLessThan(50);
});
The results.o11y object is a TranscriptSummary with these fields:
| Field | Type | Description |
|---|---|---|
shellCommands |
{ command, exitCode?, success? }[] |
Shell commands the agent ran |
filesRead |
string[] |
Files the agent read |
filesModified |
string[] |
Files the agent wrote or edited |
toolCalls |
Record<ToolName, number> |
Count of each tool type used |
totalToolCalls |
number |
Total tool calls made |
webFetches |
{ url, method?, status?, success? }[] |
Web fetches made |
totalTurns |
number |
Conversation turns |
errors |
string[] |
Errors encountered |
thinkingBlocks |
number |
Thinking/reasoning blocks |
Note: If the agent's transcript is unavailable (e.g. the agent crashed before producing output),
results.o11ywill benull.
// experiments/my-experiment.ts
import type { ExperimentConfig } from '@vercel/agent-eval';
const config: ExperimentConfig = {
// Required: which agent to use
agent: 'vercel-ai-gateway/claude-code',
// Model to use (defaults vary by agent)
// Provide an array to run the same experiment across multiple models.
model: 'opus',
// How many times to run each eval (default: 1)
runs: 10,
// Stop after first success? (default: true)
earlyExit: false,
// npm scripts that must pass after agent finishes (default: [])
scripts: ['build', 'lint'],
// Timeout per run in seconds (default: 600)
timeout: 600,
// Filter which evals to run (default: '*' for all)
evals: '*',
// evals: ['specific-eval'],
// evals: (name) => name.startsWith('api-'),
// Setup function for sandbox pre-configuration
setup: async (sandbox) => {
await sandbox.writeFiles({ '.env': 'API_KEY=test' });
await sandbox.runCommand('npm', ['run', 'setup']);
},
// Rewrite the prompt before running
editPrompt: (prompt) => `Use the skill.\n\n${prompt}`,
// Sandbox backend (default: 'auto' -- Vercel if token present, else Docker)
sandbox: 'auto',
// Copy project files to results directory (default: 'none')
// 'none' - don't copy files
// 'changed' - copy only files modified by the agent
// 'all' - copy the entire project including original fixture files
copyFiles: 'changed',
};
export default config;
// Vercel AI Gateway (recommended -- unified billing and observability)
agent: 'vercel-ai-gateway/claude-code' // Claude Code via AI Gateway
agent: 'vercel-ai-gateway/codex' // OpenAI Codex via AI Gateway
agent: 'vercel-ai-gateway/opencode' // OpenCode via AI Gateway
// Direct API (uses provider keys directly)
agent: 'claude-code' // requires ANTHROPIC_API_KEY
agent: 'codex' // requires OPENAI_API_KEY
agent: 'gemini' // requires GEMINI_API_KEY
agent: 'cursor' // requires CURSOR_API_KEY
Provide an array of models to run the same experiment on each one. Results are stored under separate directories (experiment-name/model-name):
const config: ExperimentConfig = {
agent: 'vercel-ai-gateway/claude-code',
model: ['opus', 'sonnet'],
runs: 10,
};
OpenCode uses Vercel AI Gateway exclusively. Models must use the vercel/{provider}/{model} format:
model: 'vercel/anthropic/claude-sonnet-4'
model: 'vercel/openai/gpt-4o'
model: 'vercel/moonshotai/kimi-k2'
model: 'vercel/minimax/minimax-m2.1'
The vercel/ prefix is required. Using anthropic/claude-sonnet-4 (without vercel/) will fail with a "provider not found" error.
The real power is comparing different approaches. Create multiple experiment configs:
// experiments/control.ts
import type { ExperimentConfig } from '@vercel/agent-eval';
const config: ExperimentConfig = {
agent: 'vercel-ai-gateway/claude-code',
model: 'opus',
runs: 10,
earlyExit: false,
};
export default config;
// experiments/with-mcp.ts
import type { ExperimentConfig } from '@vercel/agent-eval';
const config: ExperimentConfig = {
agent: 'vercel-ai-gateway/claude-code',
model: 'opus',
runs: 10,
earlyExit: false,
setup: async (sandbox) => {
await sandbox.runCommand('npm', ['install', '-g', '@myframework/mcp-server']);
await sandbox.writeFiles({
'.claude/settings.json': JSON.stringify({
mcpServers: { myframework: { command: 'myframework-mcp' } }
})
});
},
};
export default config;
npx @vercel/agent-eval
Compare the results:
control (baseline): 7/10 passed (70%)
with-mcp: 9/10 passed (90%)
| Experiment | Control | Treatment |
|---|---|---|
| MCP impact | No MCP | With MCP server |
| Model comparison | Haiku | Sonnet / Opus |
| Documentation | Minimal docs | Rich examples |
| System prompt | Default | Framework-specific |
| Tool availability | Read/write only | + custom tools |
Results are saved to results/<experiment>/<timestamp>/:
results/
with-mcp/
2026-01-27T10-30-00Z/
create-button/
summary.json # Pass rate, fingerprint, classification
classification.json # Cached failure classification (if failed)
run-1/
result.json # Individual run result + o11y summary
transcript.json # Parsed/structured agent transcript
transcript-raw.jsonl # Raw agent output (for debugging)
outputs/
eval.txt # EVAL.ts test output
scripts/
build.txt # npm script output
project/ # Agent-generated files (if copyFiles is set)
src/
Button.tsx # Files created/modified by the agent
Each eval directory contains a summary.json with:
{
"totalRuns": 2,
"passedRuns": 0,
"passRate": "0%",
"meanDuration": 45.2,
"fingerprint": "a1b2c3...",
"classification": {
"failureType": "infra",
"failureReason": "Rate limited (HTTP 429) — model never ran"
},
"valid": false
}
The fingerprint field enables result reuse across runs. The classification and valid fields appear only for failed evals -- valid: false marks non-model failures so they are not reused by fingerprinting and are automatically retried.
Browse results in a web-based dashboard:
npx @vercel/agent-eval playground
This opens a local Next.js app with:
Options:
npx @vercel/agent-eval playground --results-dir ./results --evals-dir ./evals --port 3001
By default, the framework only saves test outputs and transcripts. Use the copyFiles config option to also save the files generated by the agent:
const config: ExperimentConfig = {
copyFiles: 'changed', // or 'all' or 'none' (default)
};
Options:
none (default) — Don't copy any project files, only save outputs and transcriptschanged — Copy only files that were modified, created, or deleted by the agentall — Copy the complete project including both the original fixture files and agent changesFiles are saved to results/<experiment>/<timestamp>/<eval>/run-N/project/. The framework uses git to track changes, so files must be text-based to be captured.
The framework computes a SHA-256 fingerprint for each (eval, config) pair. The fingerprint covers all eval directory files and the config fields that affect results: agent, model, scripts, timeout, earlyExit, and runs.
On subsequent runs, evals with a matching fingerprint and a valid cached result (at least one passing run) are skipped automatically. This means:
evals filter -- safe, the filter is not part of the fingerprint.Use --force to bypass fingerprinting and re-run everything. Functions like setup and editPrompt cannot be hashed, so use --force when you change those.
When evals fail, the framework optionally classifies each failure as one of:
Classification uses Claude Sonnet 4.5 via the Vercel AI Gateway with sandboxed read-only tools to inspect result files. This requires AI_GATEWAY_API_KEY or VERCEL_OIDC_TOKEN to be set.
AI_GATEWAY_API_KEY or VERCEL_OIDC_TOKEN): Classifications are cached in classification.json. Non-model failures are automatically removed during housekeeping (unless --ack-failures is used). The auto-retry feature helps handle transient issues.AI_GATEWAY_API_KEY to .env to enable the classifier.When the classifier is enabled, the framework automatically retries evals that fail with non-model failures (infra or timeout) — up to 5 rounds by default. Evals that passed or were classified as model failures are cached and skipped on retries. After exhausting retries, remaining non-model failures are auto-acknowledged as final results. Control with --max-retries <n> (0 to disable) or --ack-failures to skip retries and accept immediately.
After each experiment completes, the framework automatically:
summary.json or transcripts)Every run requires an API key for the agent and a token for the sandbox. Classifier is optional.
| Variable | Required when | Description |
|---|---|---|
AI_GATEWAY_API_KEY |
vercel-ai-gateway/ agents or classifier |
Vercel AI Gateway key -- required for vercel-ai-gateway/ agents and failure classification |
ANTHROPIC_API_KEY |
agent: 'claude-code' |
Direct Anthropic API key |
OPENAI_API_KEY |
agent: 'codex' |
Direct OpenAI API key |
GEMINI_API_KEY |
agent: 'gemini' |
Direct Google Gemini API key |
CURSOR_API_KEY |
agent: 'cursor' |
Direct Cursor API key |
VERCEL_TOKEN |
Always (pick one) | Vercel personal access token -- for local dev |
VERCEL_OIDC_TOKEN |
Always (pick one) OR for classifier | Vercel OIDC token -- for CI/CD pipelines, or enables classifier without AI_GATEWAY_API_KEY |
The classifier is optional: if neither AI_GATEWAY_API_KEY nor VERCEL_OIDC_TOKEN is set, failure classification is skipped and all results are preserved as-is. Set either key to enable the classifier, which automatically identifies and removes non-model failures (infrastructure errors, rate limits, timeouts).
OpenCode only supports Vercel AI Gateway (vercel-ai-gateway/opencode). There is no direct API option for OpenCode.
The init command generates a .env.example file. Copy it and fill in your keys:
cp .env.example .env
The framework loads .env.local first, then .env as a fallback, via dotenv.
One key for all models:
AI_GATEWAY_API_KEY=your-ai-gateway-api-key
VERCEL_TOKEN=your-vercel-token
If you don't have a Vercel account, use provider API keys directly:
ANTHROPIC_API_KEY=sk-ant-... # For Claude Code
OPENAI_API_KEY=sk-proj-... # For Codex
And choose ONE sandbox option (no Vercel key needed):
# Option 1: Use Docker (free, no account needed)
# Just set sandbox: 'docker' in your experiment config, that's it!
# Option 2: Use Vercel (requires free account)
VERCEL_TOKEN=your-vercel-token
Claude Code via direct API with Docker sandbox:
// experiments/my-eval.ts
import type { ExperimentConfig } from '@vercel/agent-eval';
const config: ExperimentConfig = {
agent: 'claude-code', // Direct API (not vercel-ai-gateway/...)
model: 'opus',
runs: 1,
sandbox: 'docker', // No VERCEL_TOKEN needed
};
export default config;
Then just set:
ANTHROPIC_API_KEY=sk-ant-...
That's it! The classifier will be disabled (since you don't have AI_GATEWAY_API_KEY), but all features work fine — you'll just see a warning that non-model failure classification is skipped.
Start with --dry: Always preview before running to verify your config and avoid unexpected costs.
Use --smoke first: Verify API keys, model IDs, and sandbox connectivity before launching a full run.
Use multiple runs: Single runs don't tell you reliability. Use runs: 10 and earlyExit: false for meaningful data.
Isolate variables: Change one thing at a time between experiments. Don't compare "Opus with MCP" to "Haiku without MCP".
Test incrementally: Start with simple tasks, add complexity as you learn what works.
See CONTRIBUTING.md for development workflow and release process.
MIT