agskills.dev
MARKETPLACE

huggingface-accelerate

Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.

davila721.2k1.9k

Vorschau

SKILL.md
Metadata
name
huggingface-accelerate
description
Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.
version
1.0.0
author
Orchestra Research
license
MIT
tags
[Distributed Training, HuggingFace, Accelerate, DeepSpeed, FSDP, Mixed Precision, PyTorch, DDP, Unified API, Simple]
dependencies
[accelerate, torch, transformers]

HuggingFace Accelerate - Unified Distributed Training

Quick start

Accelerate simplifies distributed training to 4 lines of code.

Installation:

pip install accelerate

Convert PyTorch script (4 lines):

import torch + from accelerate import Accelerator + accelerator = Accelerator() model = torch.nn.Transformer() optimizer = torch.optim.Adam(model.parameters()) dataloader = torch.utils.data.DataLoader(dataset) + model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader) for batch in dataloader: optimizer.zero_grad() loss = model(batch) - loss.backward() + accelerator.backward(loss) optimizer.step()

Run (single command):

accelerate launch train.py

Common workflows

Workflow 1: From single GPU to multi-GPU

Original script:

# train.py import torch model = torch.nn.Linear(10, 2).to('cuda') optimizer = torch.optim.Adam(model.parameters()) dataloader = torch.utils.data.DataLoader(dataset, batch_size=32) for epoch in range(10): for batch in dataloader: batch = batch.to('cuda') optimizer.zero_grad() loss = model(batch).mean() loss.backward() optimizer.step()

With Accelerate (4 lines added):

# train.py import torch from accelerate import Accelerator # +1 accelerator = Accelerator() # +2 model = torch.nn.Linear(10, 2) optimizer = torch.optim.Adam(model.parameters()) dataloader = torch.utils.data.DataLoader(dataset, batch_size=32) model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader) # +3 for epoch in range(10): for batch in dataloader: # No .to('cuda') needed - automatic! optimizer.zero_grad() loss = model(batch).mean() accelerator.backward(loss) # +4 optimizer.step()

Configure (interactive):

accelerate config

Questions:

  • Which machine? (single/multi GPU/TPU/CPU)
  • How many machines? (1)
  • Mixed precision? (no/fp16/bf16/fp8)
  • DeepSpeed? (no/yes)

Launch (works on any setup):

# Single GPU accelerate launch train.py # Multi-GPU (8 GPUs) accelerate launch --multi_gpu --num_processes 8 train.py # Multi-node accelerate launch --multi_gpu --num_processes 16 \ --num_machines 2 --machine_rank 0 \ --main_process_ip $MASTER_ADDR \ train.py

Workflow 2: Mixed precision training

Enable FP16/BF16:

from accelerate import Accelerator # FP16 (with gradient scaling) accelerator = Accelerator(mixed_precision='fp16') # BF16 (no scaling, more stable) accelerator = Accelerator(mixed_precision='bf16') # FP8 (H100+) accelerator = Accelerator(mixed_precision='fp8') model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader) # Everything else is automatic! for batch in dataloader: with accelerator.autocast(): # Optional, done automatically loss = model(batch) accelerator.backward(loss)

Workflow 3: DeepSpeed ZeRO integration

Enable DeepSpeed ZeRO-2:

from accelerate import Accelerator accelerator = Accelerator( mixed_precision='bf16', deepspeed_plugin={ "zero_stage": 2, # ZeRO-2 "offload_optimizer": False, "gradient_accumulation_steps": 4 } ) # Same code as before! model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

Or via config:

accelerate config # Select: DeepSpeed → ZeRO-2

deepspeed_config.json:

{ "fp16": {"enabled": false}, "bf16": {"enabled": true}, "zero_optimization": { "stage": 2, "offload_optimizer": {"device": "cpu"}, "allgather_bucket_size": 5e8, "reduce_bucket_size": 5e8 } }

Launch:

accelerate launch --config_file deepspeed_config.json train.py

Workflow 4: FSDP (Fully Sharded Data Parallel)

Enable FSDP:

from accelerate import Accelerator, FullyShardedDataParallelPlugin fsdp_plugin = FullyShardedDataParallelPlugin( sharding_strategy="FULL_SHARD", # ZeRO-3 equivalent auto_wrap_policy="TRANSFORMER_AUTO_WRAP", cpu_offload=False ) accelerator = Accelerator( mixed_precision='bf16', fsdp_plugin=fsdp_plugin ) model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

Or via config:

accelerate config # Select: FSDP → Full Shard → No CPU Offload

Workflow 5: Gradient accumulation

Accumulate gradients:

from accelerate import Accelerator accelerator = Accelerator(gradient_accumulation_steps=4) model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader) for batch in dataloader: with accelerator.accumulate(model): # Handles accumulation optimizer.zero_grad() loss = model(batch) accelerator.backward(loss) optimizer.step()

Effective batch size: batch_size * num_gpus * gradient_accumulation_steps

When to use vs alternatives

Use Accelerate when:

  • Want simplest distributed training
  • Need single script for any hardware
  • Use HuggingFace ecosystem
  • Want flexibility (DDP/DeepSpeed/FSDP/Megatron)
  • Need quick prototyping

Key advantages:

  • 4 lines: Minimal code changes
  • Unified API: Same code for DDP, DeepSpeed, FSDP, Megatron
  • Automatic: Device placement, mixed precision, sharding
  • Interactive config: No manual launcher setup
  • Single launch: Works everywhere

Use alternatives instead:

  • PyTorch Lightning: Need callbacks, high-level abstractions
  • Ray Train: Multi-node orchestration, hyperparameter tuning
  • DeepSpeed: Direct API control, advanced features
  • Raw DDP: Maximum control, minimal abstraction

Common issues

Issue: Wrong device placement

Don't manually move to device:

# WRONG batch = batch.to('cuda') # CORRECT # Accelerate handles it automatically after prepare()

Issue: Gradient accumulation not working

Use context manager:

# CORRECT with accelerator.accumulate(model): optimizer.zero_grad() accelerator.backward(loss) optimizer.step()

Issue: Checkpointing in distributed

Use accelerator methods:

# Save only on main process if accelerator.is_main_process: accelerator.save_state('checkpoint/') # Load on all processes accelerator.load_state('checkpoint/')

Issue: Different results with FSDP

Ensure same random seed:

from accelerate.utils import set_seed set_seed(42)

Advanced topics

Megatron integration: See references/megatron-integration.md for tensor parallelism, pipeline parallelism, and sequence parallelism setup.

Custom plugins: See references/custom-plugins.md for creating custom distributed plugins and advanced configuration.

Performance tuning: See references/performance.md for profiling, memory optimization, and best practices.

Hardware requirements

  • CPU: Works (slow)
  • Single GPU: Works
  • Multi-GPU: DDP (default), DeepSpeed, or FSDP
  • Multi-node: DDP, DeepSpeed, FSDP, Megatron
  • TPU: Supported
  • Apple MPS: Supported

Launcher requirements:

  • DDP: torch.distributed.run (built-in)
  • DeepSpeed: deepspeed (pip install deepspeed)
  • FSDP: PyTorch 1.12+ (built-in)
  • Megatron: Custom setup

Resources