agskills.dev
MARKETPLACE

nemo-curator

GPU-accelerated data curation for LLM training. Supports text/image/video/audio. Features fuzzy deduplication (16× faster), quality filtering (30+ heuristics), semantic deduplication, PII redaction, NSFW detection. Scales across GPUs with RAPIDS. Use for preparing high-quality training datasets, cleaning web data, or deduplicating large corpora.

davila719.9k1.8k

Aperçu

SKILL.md
Metadata
name
nemo-curator
description
GPU-accelerated data curation for LLM training. Supports text/image/video/audio. Features fuzzy deduplication (16× faster), quality filtering (30+ heuristics), semantic deduplication, PII redaction, NSFW detection. Scales across GPUs with RAPIDS. Use for preparing high-quality training datasets, cleaning web data, or deduplicating large corpora.
version
1.0.0
author
Orchestra Research
license
MIT
tags
[Data Processing, NeMo Curator, Data Curation, GPU Acceleration, Deduplication, Quality Filtering, NVIDIA, RAPIDS, PII Redaction, Multimodal, LLM Training Data]
dependencies
[nemo-curator, cudf, dask, rapids]

NeMo Curator - GPU-Accelerated Data Curation

NVIDIA's toolkit for preparing high-quality training data for LLMs.

When to use NeMo Curator

Use NeMo Curator when:

  • Preparing LLM training data from web scrapes (Common Crawl)
  • Need fast deduplication (16× faster than CPU)
  • Curating multi-modal datasets (text, images, video, audio)
  • Filtering low-quality or toxic content
  • Scaling data processing across GPU cluster

Performance:

  • 16× faster fuzzy deduplication (8TB RedPajama v2)
  • 40% lower TCO vs CPU alternatives
  • Near-linear scaling across GPU nodes

Use alternatives instead:

  • datatrove: CPU-based, open-source data processing
  • dolma: Allen AI's data toolkit
  • Ray Data: General ML data processing (no curation focus)

Quick start

Installation

# Text curation (CUDA 12) uv pip install "nemo-curator[text_cuda12]" # All modalities uv pip install "nemo-curator[all_cuda12]" # CPU-only (slower) uv pip install "nemo-curator[cpu]"

Basic text curation pipeline

from nemo_curator import ScoreFilter, Modify from nemo_curator.datasets import DocumentDataset import pandas as pd # Load data df = pd.DataFrame({"text": ["Good document", "Bad doc", "Excellent text"]}) dataset = DocumentDataset(df) # Quality filtering def quality_score(doc): return len(doc["text"].split()) > 5 # Filter short docs filtered = ScoreFilter(quality_score)(dataset) # Deduplication from nemo_curator.modules import ExactDuplicates deduped = ExactDuplicates()(filtered) # Save deduped.to_parquet("curated_data/")

Data curation pipeline

Stage 1: Quality filtering

from nemo_curator.filters import ( WordCountFilter, RepeatedLinesFilter, UrlRatioFilter, NonAlphaNumericFilter ) # Apply 30+ heuristic filters from nemo_curator import ScoreFilter # Word count filter dataset = dataset.filter(WordCountFilter(min_words=50, max_words=100000)) # Remove repetitive content dataset = dataset.filter(RepeatedLinesFilter(max_repeated_line_fraction=0.3)) # URL ratio filter dataset = dataset.filter(UrlRatioFilter(max_url_ratio=0.2))

Stage 2: Deduplication

Exact deduplication:

from nemo_curator.modules import ExactDuplicates # Remove exact duplicates deduped = ExactDuplicates(id_field="id", text_field="text")(dataset)

Fuzzy deduplication (16× faster on GPU):

from nemo_curator.modules import FuzzyDuplicates # MinHash + LSH deduplication fuzzy_dedup = FuzzyDuplicates( id_field="id", text_field="text", num_hashes=260, # MinHash parameters num_buckets=20, hash_method="md5" ) deduped = fuzzy_dedup(dataset)

Semantic deduplication:

from nemo_curator.modules import SemanticDuplicates # Embedding-based deduplication semantic_dedup = SemanticDuplicates( id_field="id", text_field="text", embedding_model="sentence-transformers/all-MiniLM-L6-v2", threshold=0.8 # Cosine similarity threshold ) deduped = semantic_dedup(dataset)

Stage 3: PII redaction

from nemo_curator.modules import Modify from nemo_curator.modifiers import PIIRedactor # Redact personally identifiable information pii_redactor = PIIRedactor( supported_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "PERSON", "LOCATION"], anonymize_action="replace" # or "redact" ) redacted = Modify(pii_redactor)(dataset)

Stage 4: Classifier filtering

from nemo_curator.classifiers import QualityClassifier # Quality classification quality_clf = QualityClassifier( model_path="nvidia/quality-classifier-deberta", batch_size=256, device="cuda" ) # Filter low-quality documents high_quality = dataset.filter(lambda doc: quality_clf(doc["text"]) > 0.5)

GPU acceleration

GPU vs CPU performance

OperationCPU (16 cores)GPU (A100)Speedup
Fuzzy dedup (8TB)120 hours7.5 hours16×
Exact dedup (1TB)8 hours0.5 hours16×
Quality filtering2 hours0.2 hours10×

Multi-GPU scaling

from nemo_curator import get_client import dask_cuda # Initialize GPU cluster client = get_client(cluster_type="gpu", n_workers=8) # Process with 8 GPUs deduped = FuzzyDuplicates(...)(dataset)

Multi-modal curation

Image curation

from nemo_curator.image import ( AestheticFilter, NSFWFilter, CLIPEmbedder ) # Aesthetic scoring aesthetic_filter = AestheticFilter(threshold=5.0) filtered_images = aesthetic_filter(image_dataset) # NSFW detection nsfw_filter = NSFWFilter(threshold=0.9) safe_images = nsfw_filter(filtered_images) # Generate CLIP embeddings clip_embedder = CLIPEmbedder(model="openai/clip-vit-base-patch32") image_embeddings = clip_embedder(safe_images)

Video curation

from nemo_curator.video import ( SceneDetector, ClipExtractor, InternVideo2Embedder ) # Detect scenes scene_detector = SceneDetector(threshold=27.0) scenes = scene_detector(video_dataset) # Extract clips clip_extractor = ClipExtractor(min_duration=2.0, max_duration=10.0) clips = clip_extractor(scenes) # Generate embeddings video_embedder = InternVideo2Embedder() video_embeddings = video_embedder(clips)

Audio curation

from nemo_curator.audio import ( ASRInference, WERFilter, DurationFilter ) # ASR transcription asr = ASRInference(model="nvidia/stt_en_fastconformer_hybrid_large_pc") transcribed = asr(audio_dataset) # Filter by WER (word error rate) wer_filter = WERFilter(max_wer=0.3) high_quality_audio = wer_filter(transcribed) # Duration filtering duration_filter = DurationFilter(min_duration=1.0, max_duration=30.0) filtered_audio = duration_filter(high_quality_audio)

Common patterns

Web scrape curation (Common Crawl)

from nemo_curator import ScoreFilter, Modify from nemo_curator.filters import * from nemo_curator.modules import * from nemo_curator.datasets import DocumentDataset # Load Common Crawl data dataset = DocumentDataset.read_parquet("common_crawl/*.parquet") # Pipeline pipeline = [ # 1. Quality filtering WordCountFilter(min_words=100, max_words=50000), RepeatedLinesFilter(max_repeated_line_fraction=0.2), SymbolToWordRatioFilter(max_symbol_to_word_ratio=0.3), UrlRatioFilter(max_url_ratio=0.3), # 2. Language filtering LanguageIdentificationFilter(target_languages=["en"]), # 3. Deduplication ExactDuplicates(id_field="id", text_field="text"), FuzzyDuplicates(id_field="id", text_field="text", num_hashes=260), # 4. PII redaction PIIRedactor(), # 5. NSFW filtering NSFWClassifier(threshold=0.8) ] # Execute for stage in pipeline: dataset = stage(dataset) # Save dataset.to_parquet("curated_common_crawl/")

Distributed processing

from nemo_curator import get_client from dask_cuda import LocalCUDACluster # Multi-GPU cluster cluster = LocalCUDACluster(n_workers=8) client = get_client(cluster=cluster) # Process large dataset dataset = DocumentDataset.read_parquet("s3://large_dataset/*.parquet") deduped = FuzzyDuplicates(...)(dataset) # Cleanup client.close() cluster.close()

Performance benchmarks

Fuzzy deduplication (8TB RedPajama v2)

  • CPU (256 cores): 120 hours
  • GPU (8× A100): 7.5 hours
  • Speedup: 16×

Exact deduplication (1TB)

  • CPU (64 cores): 8 hours
  • GPU (4× A100): 0.5 hours
  • Speedup: 16×

Quality filtering (100GB)

  • CPU (32 cores): 2 hours
  • GPU (2× A100): 0.2 hours
  • Speedup: 10×

Cost comparison

CPU-based curation (AWS c5.18xlarge × 10):

  • Cost: $3.60/hour × 10 = $36/hour
  • Time for 8TB: 120 hours
  • Total: $4,320

GPU-based curation (AWS p4d.24xlarge × 2):

  • Cost: $32.77/hour × 2 = $65.54/hour
  • Time for 8TB: 7.5 hours
  • Total: $491.55

Savings: 89% reduction ($3,828 saved)

Supported data formats

  • Input: Parquet, JSONL, CSV
  • Output: Parquet (recommended), JSONL
  • WebDataset: TAR archives for multi-modal

Use cases

Production deployments:

  • NVIDIA used NeMo Curator to prepare Nemotron-4 training data
  • Open-source datasets curated: RedPajama v2, The Pile

References

Resources