FlashInfer: Kernel Library for LLM Serving
npx skills add https://github.com/flashinfer-ai/flashinfer --skill benchmark-kernelInstallieren Sie diesen Skill über die CLI und beginnen Sie mit der Verwendung des SKILL.md-Workflows in Ihrem Arbeitsbereich.
| Documentation | Latest Release | Blog | Slack | Discussion Forum |
FlashInfer is a library and kernel generator for inference that delivers state-of-the-art performance across diverse GPU architectures. It provides unified APIs for attention, GEMM, and MoE operations with multiple backend implementations including FlashAttention-2/3, cuDNN, CUTLASS, and TensorRT-LLM.
| Architecture | Compute Capability | Example GPUs |
|---|---|---|
| Turing | SM 7.5 | T4, RTX 20 series |
| Ampere | SM 8.0, 8.6 | A100, A10, RTX 30 series |
| Ada Lovelace | SM 8.9 | L4, L40, RTX 40 series |
| Hopper | SM 9.0 | H100, H200 |
| Blackwell | SM 10.0, 10.3 | B200, B300 |
| Blackwell | SM 11.0 | Jetson Thor |
| Blackwell | SM 12.0, 12.1 | RTX 50 series, DGX Spark |
Note: Not all features are supported across all compute capabilities.
Notable updates:
Quickstart:
pip install flashinfer-python
Package Options:
For faster initialization and offline usage, install the optional packages to have most kernels pre-compiled:
pip install flashinfer-python flashinfer-cubin
# JIT cache (replace cu129 with your CUDA version)
pip install flashinfer-jit-cache --index-url https://flashinfer.ai/whl/cu129
For Blackwell (SM100+) CuTe DSL kernels, install with the CUDA 13 extra to enable Blackwell-optimized kernels:
pip install flashinfer-python[cu13]
flashinfer show-config
import torch
import flashinfer
# Single decode attention
q = torch.randn(32, 128, device="cuda", dtype=torch.float16) # [num_qo_heads, head_dim]
k = torch.randn(2048, 32, 128, device="cuda", dtype=torch.float16) # [kv_len, num_kv_heads, head_dim]
v = torch.randn(2048, 32, 128, device="cuda", dtype=torch.float16)
output = flashinfer.single_decode_with_kv_cache(q, k, v)
See documentation for comprehensive API reference and tutorials.
git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
cd flashinfer
python -m pip install -v .
For development, install in editable mode:
python -m pip install --no-build-isolation -e . -v
Note: When using
--no-build-isolation, pip does not automatically install build dependencies. FlashInfer requiressetuptools>=77. If you encounter an error likeAttributeError: module 'setuptools.build_meta' has no attribute 'prepare_metadata_for_build_editable', upgrade pip and setuptools first:python -m pip install --upgrade pip setuptools
Build optional packages:
# flashinfer-cubin
cd flashinfer-cubin
python -m build --no-isolation --wheel
python -m pip install dist/*.whl
# flashinfer-jit-cache (customize for your target GPUs)
export FLASHINFER_CUDA_ARCH_LIST="7.5 8.0 8.9 9.0a 10.0a 10.3a 11.0a 12.0f"
cd flashinfer-jit-cache
python -m build --no-isolation --wheel
python -m pip install dist/*.whl
For more details, see the Install from Source documentation.
pip install -U --pre flashinfer-python --index-url https://flashinfer.ai/whl/nightly/ --no-deps
pip install flashinfer-python # Install dependencies from PyPI
pip install -U --pre flashinfer-cubin --index-url https://flashinfer.ai/whl/nightly/
# JIT cache (replace cu129 with your CUDA version)
pip install -U --pre flashinfer-jit-cache --index-url https://flashinfer.ai/whl/nightly/cu129
FlashInfer provides several CLI commands for configuration, module management, and development:
# Verify installation and view configuration
flashinfer show-config
# List and inspect modules
flashinfer list-modules
flashinfer module-status
# Manage artifacts and cache
flashinfer download-cubin
flashinfer clear-cache
# For developers: generate compile_commands.json for IDE integration
flashinfer export-compile-commands [output_path]
For complete documentation, see the CLI reference.
FlashInfer provides comprehensive API logging for debugging. Enable it using environment variables:
# Enable logging (levels: 0=off (default), 1=basic, 3=detailed, 5=statistics)
export FLASHINFER_LOGLEVEL=3
# Set log destination (stdout (default), stderr, or file path)
export FLASHINFER_LOGDEST=stdout
For detailed information about logging levels, configuration, and advanced features, see Logging in our documentation.
Users can customize their own attention variants with additional parameters. For more details, refer to our JIT examples.
Supported CUDA Versions: 12.6, 12.8, 13.0, 13.1
Note: FlashInfer strives to follow PyTorch's supported CUDA versions plus the latest CUDA release.
FlashInfer powers inference in:
FlashInfer is inspired by FlashAttention, vLLM, stream-K, CUTLASS, and AITemplate.
If you find FlashInfer helpful in your project or research, please consider citing our paper:
@article{ye2025flashinfer,
title = {FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving},
author = {
Ye, Zihao and
Chen, Lequn and
Lai, Ruihang and
Lin, Wuwei and
Zhang, Yineng and
Wang, Stephanie and
Chen, Tianqi and
Kasikci, Baris and
Grover, Vinod and
Krishnamurthy, Arvind and
Ceze, Luis
},
journal = {arXiv preprint arXiv:2501.01005},
year = {2025},
url = {https://arxiv.org/abs/2501.01005}
}