Public repository for Datadog Agent Skills
npx skills add https://github.com/datadog-labs/agent-skills --skill dd-pupقم بتثبيت هذه المهارة باستخدام واجهة سطر الأوامر (CLI) وابدأ في استخدام سير عمل SKILL.md في مساحة عملك.
Datadog skills for Claude Code, Codex CLI, Gemini CLI, Cursor, Windsurf, OpenCode, and other AI agents.
| Skill | Description |
|---|---|
| dd-pup | Primary CLI - commands, auth, PATH setup |
| dd-monitors | Create, manage, mute monitors |
| dd-logs | Search logs |
| dd-apm | Traces, services, performance |
| dd-docs | Search Datadog documentation |
| dd-llmo | LLM Observability: experiments, eval RCA, evaluator generation, session classification |
# Homebrew (macOS/Linux) — recommended
brew tap datadog-labs/pack
brew install datadog-labs/pack/pup
# Or build from source
git clone https://github.com/datadog-labs/pup.git && cd pup
cargo build --release
cp target/release/pup ~/.local/bin
Pre-built binaries are also available from the latest release.
# Authenticate
pup auth login
For JUST dd-pup:
npx skills add datadog-labs/agent-skills \
--skill dd-pup \
--full-depth -y
npx skills add datadog-labs/agent-skills \
--skill dd-pup \
--skill dd-monitors \
--skill dd-logs \
--skill dd-apm \
--skill dd-docs \
--full-depth -y
The dd-llmo directory contains four skills for working with LLM Observability data:
| Skill | Purpose |
|---|---|
experiment-analyzer |
Analyze and compare offline LLM experiments |
eval-trace-rca |
Root-cause production failures using eval judge signal or runtime errors |
eval-bootstrap |
Generate evaluator code from traces, optionally seeded by RCA output |
eval-session-classify |
Classify whether user intent was satisfied in a session (trace + RUM signals) |
Eval pipeline flow:
eval-session-classify eval-trace-rca → eval-bootstrap
(classify sessions) (diagnose why) (build evals)
Run eval-trace-rca to understand why an app is failing by analyzing eval judge verdicts or
runtime errors across production traces. Then run eval-bootstrap to generate evaluator code
that captures those failure patterns. Pass the RCA output directly to eval-bootstrap to seed
it with the discovered failure taxonomy.
Use eval-session-classify independently to evaluate whether individual assistant sessions
satisfied user intent, combining LLM Obs trace data with RUM behavioral signals.
# Claude Code — copy any or all skills
cp -r dd-llmo/experiment-analyzer ~/.claude/skills
cp -r dd-llmo/eval-trace-rca ~/.claude/skills
cp -r dd-llmo/eval-bootstrap ~/.claude/skills
cp -r dd-llmo/eval-session-classify ~/.claude/skills
All four skills require the LLMO toolset:
claude mcp add --scope user --transport http "datadog-llmo-mcp" 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'
experiment-analyzer uses the core toolset for notebook export (optional). eval-session-classify
requires it for RUM behavioral analysis and efficient batched fetches of trace session spans:
claude mcp add --scope user --transport http "datadog-mcp-core" 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=core'
# Analyze experiments
experiment-analyzer <experiment_id> # single experiment
experiment-analyzer <baseline_id> <candidate_id> # compare two experiments
experiment-analyzer <id(s)> <question> # ask a specific question
experiment-analyzer <id(s)> [question] --output notebook # export to Datadog notebook
# Root-cause why an app is failing
What's wrong with <ml_app> based on its evals over the last 24h
Analyze eval failures for <eval_name> over the last week
Look at the errors on <ml_app> over the last 24h
# Generate evaluator code from production traces
/eval-bootstrap <ml_app> # cold start
/eval-bootstrap <ml_app> [paste eval-trace-rca output here] # seeded from RCA
/eval-bootstrap <ml_app> --data-only # emit JSON spec instead of Python SDK code
# Classify a session
/eval-session-classify <session_id>
| Task | Command |
|---|---|
| Search error logs | pup logs search --query "status:error" --from 1h |
| List monitors | pup monitors list |
| Schedule monitor downtime | pup downtime create --file downtime.json |
| Find slow traces | pup traces search --query "service:api @duration:>500ms" --from 1h |
| Query metrics | pup metrics query --query "avg:system.cpu.user{*}" |
| List services for an env (required) | pup apm services list --env <env> --from 1h --to now |
| Check auth | pup auth status |
| Refresh token | pup auth refresh |
More commands for pup are found in the official pup docs.
# Check auth first (includes token time remaining)
pup auth status
# If commands fail with 401/403, try refresh first
pup auth refresh
# If refresh fails or no session exists, do full OAuth login
pup auth login
# Non-default site/org
pup auth login --site datadoghq.eu --org <org>
If the browser opens the wrong profile/window, use the one-time URL printed by pup auth login and open it manually in the correct session.
Additional skills available soon.
# List all available
npx skills add datadog-labs/agent-skills --list --full-depth
MIT