Datadog Skills for AI Agents

Datadog skills for Claude Code, Codex CLI, Gemini CLI, Cursor, Windsurf, OpenCode, and other AI agents.

Skills

Skill	Description
dd-pup	Primary CLI - commands, auth, PATH setup
dd-monitors	Create, manage, mute monitors
dd-logs	Search logs
dd-apm	Traces, services, performance, Single-Step Instrumentation
dd-docs	Search Datadog documentation
dd-llmo	LLM Observability: experiments, eval RCA, evaluator generation, session classification
dd-browser-sdk	Browser SDK: RUM, Logs, Session Replay, profiling, product analytics, error tracking, version migration
dd-audit	Audit Trail investigations: who changed what, key compromise, cost spike root cause, compliance evidence (SOC 2/PCI), AI activity auditing

Install

Setup Pup

# Homebrew (macOS/Linux) — recommended
brew tap datadog-labs/pack
brew install datadog-labs/pack/pup

# Or build from source
git clone https://github.com/datadog-labs/pup.git && cd pup
cargo build --release
cp target/release/pup ~/.local/bin

Pre-built binaries are also available from the latest release.

# Authenticate
pup auth login

Add Skill(s)

For JUST dd-pup:

npx skills add datadog-labs/agent-skills \
  --skill dd-pup \
  --full-depth -y

npx skills add datadog-labs/agent-skills \
  --skill dd-pup \
  --skill dd-monitors \
  --skill dd-logs \
  --skill dd-apm \
  --skill dd-docs \
  --full-depth -y

LLM Observability (LLMO)

The dd-llmo directory contains four skills for working with LLM Observability data:

Skill	Purpose
`experiment-analyzer`	Analyze and compare offline LLM experiments
`eval-trace-rca`	Root-cause production failures using eval judge signal or runtime errors
`eval-bootstrap`	Generate evaluator code from traces, optionally seeded by RCA output
`eval-session-classify`	Classify whether user intent was satisfied in a session (trace + RUM signals)

Eval pipeline flow:

eval-session-classify          eval-trace-rca → eval-bootstrap
 (classify sessions)           (diagnose why)   (build evals)

Run eval-trace-rca to understand why an app is failing by analyzing eval judge verdicts or
runtime errors across production traces. Then run eval-bootstrap to generate evaluator code
that captures those failure patterns. Pass the RCA output directly to eval-bootstrap to seed
it with the discovered failure taxonomy.

Use eval-session-classify independently to evaluate whether individual assistant sessions
satisfied user intent, combining LLM Obs trace data with RUM behavioral signals.

Install

# Claude Code — copy any or all skills
cp -r dd-llmo/experiment-analyzer ~/.claude/skills
cp -r dd-llmo/eval-trace-rca ~/.claude/skills
cp -r dd-llmo/eval-bootstrap ~/.claude/skills
cp -r dd-llmo/eval-session-classify ~/.claude/skills

MCP Requirements

All four skills require the LLMO toolset:

claude mcp add --scope user --transport http "datadog-llmo-mcp" 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'

experiment-analyzer uses the core toolset for notebook export (optional). eval-session-classify
requires it for RUM behavioral analysis and efficient batched fetches of trace session spans:

claude mcp add --scope user --transport http "datadog-mcp-core" 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=core'

Usage

# Analyze experiments
experiment-analyzer <experiment_id>                         # single experiment
experiment-analyzer <baseline_id> <candidate_id>            # compare two experiments
experiment-analyzer <id(s)> <question>                      # ask a specific question
experiment-analyzer <id(s)> [question] --output notebook    # export to Datadog notebook

# Root-cause why an app is failing
What's wrong with <ml_app> based on its evals over the last 24h
Analyze eval failures for <eval_name> over the last week
Look at the errors on <ml_app> over the last 24h

# Generate evaluator code from production traces
/eval-bootstrap <ml_app>                                    # cold start
/eval-bootstrap <ml_app> [paste eval-trace-rca output here] # seeded from RCA
/eval-bootstrap <ml_app> --data-only                        # emit JSON spec instead of Python SDK code

# Classify a session
/eval-session-classify <session_id>

Audit Trail (dd-audit)

The dd-audit directory contains five skills for investigating Datadog Audit Trail data:

Skill	Purpose
`security-investigation`	Who changed what, user activity, login geo, deletions, permission changes
`key-compromise`	Investigate a potentially compromised API key — timeline, geo/IP, endpoints called
`cost-spike-investigation`	Correlate usage spike (Usage Metering) with config changes (Audit Trail) to find root cause
`compliance-report`	Generate SOC 2 / PCI DSS evidence from audit data
`ai-activity-audit`	Audit what the Bits AI / MCP assistant did in your org

Prerequisites

These skills use the Datadog Audit REST API directly (no pup audit command exists yet). You need an API key + App key with audit_logs_read scope:

export DD_API_KEY=<your-api-key>
export DD_APP_KEY=<your-app-key>
export DD_SITE=datadoghq.com   # or us3/us5/eu/ap1/ap2

Install

# Claude Code — copy any or all skills
cp -r dd-audit/security-investigation ~/.claude/skills
cp -r dd-audit/key-compromise ~/.claude/skills
cp -r dd-audit/cost-spike-investigation ~/.claude/skills
cp -r dd-audit/compliance-report ~/.claude/skills
cp -r dd-audit/ai-activity-audit ~/.claude/skills

Usage

# Security investigation
Who deleted monitors in the last 24 hours?
What did [email protected] do this week?
Show login activity from unexpected locations

# Key compromise
Was API key <key_id> used from unexpected locations?
Investigate this API key: <key_id>

# Cost spike
Why did our LLM Observability usage spike on May 1?
What caused the cost increase this week?

# Compliance
Generate SOC 2 evidence for CC6.2 and CC6.3 for Q1 2026
Create a PCI DSS Requirement 10 report for the last 90 days

# AI activity
What did the Bits AI assistant do in my org this week?
Show me a governance report for AI tool calls in April

Quick Reference

Task	Command
Search error logs	`pup logs search --query "status:error" --from 1h`
List monitors	`pup monitors list`
Schedule monitor downtime	`pup downtime create --file downtime.json`
Find slow traces	`pup traces search --query "service:api @duration:>500ms" --from 1h`
Query metrics	`pup metrics query --query "avg:system.cpu.user{*}"`
List services for an env (required)	`pup apm services list --env <env> --from 1h --to now`
Check auth	`pup auth status`
Refresh token	`pup auth refresh`

More commands for pup are found in the official pup docs.

Auth

# Check auth first (includes token time remaining)
pup auth status

# If commands fail with 401/403, try refresh first
pup auth refresh

# If refresh fails or no session exists, do full OAuth login
pup auth login

# Non-default site/org
pup auth login --site datadoghq.eu --org <org>

If the browser opens the wrong profile/window, use the one-time URL printed by pup auth login and open it manually in the correct session.

More Skills

Additional skills available soon.

# List all available
npx skills add datadog-labs/agent-skills --list --full-depth

License

MIT