agskills.dev
MARKETPLACE

nemo-evaluator-sdk

Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.

davila721.3k2.0k

PrΓ©via

SKILL.md
Metadata
name
nemo-evaluator-sdk
description
Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.
version
1.0.0
author
Orchestra Research
license
MIT
tags
[Evaluation, NeMo, NVIDIA, Benchmarking, MMLU, HumanEval, Multi-Backend, Slurm, Docker, Reproducible, Enterprise]
dependencies
[nemo-evaluator-launcher>=0.1.25, docker]

NeMo Evaluator SDK - Enterprise LLM Benchmarking

Quick Start

NeMo Evaluator SDK evaluates LLMs across 100+ benchmarks from 18+ harnesses using containerized, reproducible evaluation with multi-backend execution (local Docker, Slurm HPC, Lepton cloud).

Installation:

pip install nemo-evaluator-launcher

Set API key and run evaluation:

export NGC_API_KEY=nvapi-your-key-here # Create minimal config cat > config.yaml << 'EOF' defaults: - execution: local - deployment: none - _self_ execution: output_dir: ./results target: api_endpoint: model_id: meta/llama-3.1-8b-instruct url: https://integrate.api.nvidia.com/v1/chat/completions api_key_name: NGC_API_KEY evaluation: tasks: - name: ifeval EOF # Run evaluation nemo-evaluator-launcher run --config-dir . --config-name config

View available tasks:

nemo-evaluator-launcher ls tasks

Common Workflows

Workflow 1: Evaluate Model on Standard Benchmarks

Run core academic benchmarks (MMLU, GSM8K, IFEval) on any OpenAI-compatible endpoint.

Checklist:

Standard Evaluation:
- [ ] Step 1: Configure API endpoint
- [ ] Step 2: Select benchmarks
- [ ] Step 3: Run evaluation
- [ ] Step 4: Check results

Step 1: Configure API endpoint

# config.yaml defaults: - execution: local - deployment: none - _self_ execution: output_dir: ./results target: api_endpoint: model_id: meta/llama-3.1-8b-instruct url: https://integrate.api.nvidia.com/v1/chat/completions api_key_name: NGC_API_KEY

For self-hosted endpoints (vLLM, TRT-LLM):

target: api_endpoint: model_id: my-model url: http://localhost:8000/v1/chat/completions api_key_name: "" # No key needed for local

Step 2: Select benchmarks

Add tasks to your config:

evaluation: tasks: - name: ifeval # Instruction following - name: gpqa_diamond # Graduate-level QA env_vars: HF_TOKEN: HF_TOKEN # Some tasks need HF token - name: gsm8k_cot_instruct # Math reasoning - name: humaneval # Code generation

Step 3: Run evaluation

# Run with config file nemo-evaluator-launcher run \ --config-dir . \ --config-name config # Override output directory nemo-evaluator-launcher run \ --config-dir . \ --config-name config \ -o execution.output_dir=./my_results # Limit samples for quick testing nemo-evaluator-launcher run \ --config-dir . \ --config-name config \ -o +evaluation.nemo_evaluator_config.config.params.limit_samples=10

Step 4: Check results

# Check job status nemo-evaluator-launcher status <invocation_id> # List all runs nemo-evaluator-launcher ls runs # View results cat results/<invocation_id>/<task>/artifacts/results.yml

Workflow 2: Run Evaluation on Slurm HPC Cluster

Execute large-scale evaluation on HPC infrastructure.

Checklist:

Slurm Evaluation:
- [ ] Step 1: Configure Slurm settings
- [ ] Step 2: Set up model deployment
- [ ] Step 3: Launch evaluation
- [ ] Step 4: Monitor job status

Step 1: Configure Slurm settings

# slurm_config.yaml defaults: - execution: slurm - deployment: vllm - _self_ execution: hostname: cluster.example.com account: my_slurm_account partition: gpu output_dir: /shared/results walltime: "04:00:00" nodes: 1 gpus_per_node: 8

Step 2: Set up model deployment

deployment: checkpoint_path: /shared/models/llama-3.1-8b tensor_parallel_size: 2 data_parallel_size: 4 max_model_len: 4096 target: api_endpoint: model_id: llama-3.1-8b # URL auto-generated by deployment

Step 3: Launch evaluation

nemo-evaluator-launcher run \ --config-dir . \ --config-name slurm_config

Step 4: Monitor job status

# Check status (queries sacct) nemo-evaluator-launcher status <invocation_id> # View detailed info nemo-evaluator-launcher info <invocation_id> # Kill if needed nemo-evaluator-launcher kill <invocation_id>

Workflow 3: Compare Multiple Models

Benchmark multiple models on the same tasks for comparison.

Checklist:

Model Comparison:
- [ ] Step 1: Create base config
- [ ] Step 2: Run evaluations with overrides
- [ ] Step 3: Export and compare results

Step 1: Create base config

# base_eval.yaml defaults: - execution: local - deployment: none - _self_ execution: output_dir: ./comparison_results evaluation: nemo_evaluator_config: config: params: temperature: 0.01 parallelism: 4 tasks: - name: mmlu_pro - name: gsm8k_cot_instruct - name: ifeval

Step 2: Run evaluations with model overrides

# Evaluate Llama 3.1 8B nemo-evaluator-launcher run \ --config-dir . \ --config-name base_eval \ -o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \ -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions # Evaluate Mistral 7B nemo-evaluator-launcher run \ --config-dir . \ --config-name base_eval \ -o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3 \ -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions

Step 3: Export and compare

# Export to MLflow nemo-evaluator-launcher export <invocation_id_1> --dest mlflow nemo-evaluator-launcher export <invocation_id_2> --dest mlflow # Export to local JSON nemo-evaluator-launcher export <invocation_id> --dest local --format json # Export to Weights & Biases nemo-evaluator-launcher export <invocation_id> --dest wandb

Workflow 4: Safety and Vision-Language Evaluation

Evaluate models on safety benchmarks and VLM tasks.

Checklist:

Safety/VLM Evaluation:
- [ ] Step 1: Configure safety tasks
- [ ] Step 2: Set up VLM tasks (if applicable)
- [ ] Step 3: Run evaluation

Step 1: Configure safety tasks

evaluation: tasks: - name: aegis # Safety harness - name: wildguard # Safety classification - name: garak # Security probing

Step 2: Configure VLM tasks

# For vision-language models target: api_endpoint: type: vlm # Vision-language endpoint model_id: nvidia/llama-3.2-90b-vision-instruct url: https://integrate.api.nvidia.com/v1/chat/completions evaluation: tasks: - name: ocrbench # OCR evaluation - name: chartqa # Chart understanding - name: mmmu # Multimodal understanding

When to Use vs Alternatives

Use NeMo Evaluator when:

  • Need 100+ benchmarks from 18+ harnesses in one platform
  • Running evaluations on Slurm HPC clusters or cloud
  • Requiring reproducible containerized evaluation
  • Evaluating against OpenAI-compatible APIs (vLLM, TRT-LLM, NIMs)
  • Need enterprise-grade evaluation with result export (MLflow, W&B)

Use alternatives instead:

  • lm-evaluation-harness: Simpler setup for quick local evaluation
  • bigcode-evaluation-harness: Focused only on code benchmarks
  • HELM: Stanford's broader evaluation (fairness, efficiency)
  • Custom scripts: Highly specialized domain evaluation

Supported Harnesses and Tasks

HarnessTask CountCategories
lm-evaluation-harness60+MMLU, GSM8K, HellaSwag, ARC
simple-evals20+GPQA, MATH, AIME
bigcode-evaluation-harness25+HumanEval, MBPP, MultiPL-E
safety-harness3Aegis, WildGuard
garak1Security probing
vlmevalkit6+OCRBench, ChartQA, MMMU
bfcl6Function calling v2/v3
mtbench2Multi-turn conversation
livecodebench10+Live coding evaluation
helm15Medical domain
nemo-skills8Math, science, agentic

Common Issues

Issue: Container pull fails

Ensure NGC credentials are configured:

docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEY

Issue: Task requires environment variable

Some tasks need HF_TOKEN or JUDGE_API_KEY:

evaluation: tasks: - name: gpqa_diamond env_vars: HF_TOKEN: HF_TOKEN # Maps env var name to env var

Issue: Evaluation timeout

Increase parallelism or reduce samples:

-o +evaluation.nemo_evaluator_config.config.params.parallelism=8 -o +evaluation.nemo_evaluator_config.config.params.limit_samples=100

Issue: Slurm job not starting

Check Slurm account and partition:

execution: account: correct_account partition: gpu qos: normal # May need specific QOS

Issue: Different results than expected

Verify configuration matches reported settings:

evaluation: nemo_evaluator_config: config: params: temperature: 0.0 # Deterministic num_fewshot: 5 # Check paper's fewshot count

CLI Reference

CommandDescription
runExecute evaluation with config
status <id>Check job status
info <id>View detailed job info
ls tasksList available benchmarks
ls runsList all invocations
export <id>Export results (mlflow/wandb/local)
kill <id>Terminate running job

Configuration Override Examples

# Override model endpoint -o target.api_endpoint.model_id=my-model -o target.api_endpoint.url=http://localhost:8000/v1/chat/completions # Add evaluation parameters -o +evaluation.nemo_evaluator_config.config.params.temperature=0.5 -o +evaluation.nemo_evaluator_config.config.params.parallelism=8 -o +evaluation.nemo_evaluator_config.config.params.limit_samples=50 # Change execution settings -o execution.output_dir=/custom/path -o execution.mode=parallel # Dynamically set tasks -o 'evaluation.tasks=[{name: ifeval}, {name: gsm8k}]'

Python API Usage

For programmatic evaluation without the CLI:

from nemo_evaluator.core.evaluate import evaluate from nemo_evaluator.api.api_dataclasses import ( EvaluationConfig, EvaluationTarget, ApiEndpoint, EndpointType, ConfigParams ) # Configure evaluation eval_config = EvaluationConfig( type="mmlu_pro", output_dir="./results", params=ConfigParams( limit_samples=10, temperature=0.0, max_new_tokens=1024, parallelism=4 ) ) # Configure target endpoint target_config = EvaluationTarget( api_endpoint=ApiEndpoint( model_id="meta/llama-3.1-8b-instruct", url="https://integrate.api.nvidia.com/v1/chat/completions", type=EndpointType.CHAT, api_key="nvapi-your-key-here" ) ) # Run evaluation result = evaluate(eval_cfg=eval_config, target_cfg=target_config)

Advanced Topics

Multi-backend execution: See references/execution-backends.md Configuration deep-dive: See references/configuration.md Adapter and interceptor system: See references/adapter-system.md Custom benchmark integration: See references/custom-benchmarks.md

Requirements

  • Python: 3.10-3.13
  • Docker: Required for local execution
  • NGC API Key: For pulling containers and using NVIDIA Build
  • HF_TOKEN: Required for some benchmarks (GPQA, MMLU)

Resources