npx skills add https://github.com/pepperu96/hyper-mla --skill cute-dsl-refInstallez cette compétence avec la CLI et commencez à utiliser le flux de travail SKILL.md dans votre espace de travail.
HyperMLA implements optimized Multi-Latent Attention (MLA) kernels for DeepSeek-V3/R1 and Kimi-V2 inference. It provides kernels that interpolate between FlashMLA (optimal for decode, $s=1$) and FlashAttention (optimal for prefill, $s \geq 128$), addressing the performance gap in speculative decoding regimes ($1 < s < 128$).
./setup_cutile.sh
source .venv/bin/activate
Make sure to activate the virtual environment with source .venv/bin/activate before running the kernels.
Make sure to lock GPU clocks for reproducible profiling results! See GPU Clock Control section for scripts and details.
CRITICAL: Always run one profiling at a time on the same device, GPU memory is limited!
We offer a double-versioned kernel structure (see Kernel Naming and Versioning) to keep multiple versions of the same kernel design intact for easy comparison. The main entry points are of two types:
source .venv/bin/activate && ./scripts/lock-gpu-clock.sh
python3 -m mla_var3.kernel <kernel-name> [<version>] [kernel arguments]
source .venv/bin/activate && ./scripts/lock-gpu-clock.sh
python3 -m mla_var3.kernel.cutile.mla.flash_attention [<version>] --b=32 --s=512 --t=4096
The framework is language-aware. cuTile Python DSL (cutile-dsl) is the default implementation path, and kernel packages are organized under src/mla_var3/kernel/<language>/.... The currently supported language keys are cutile-dsl and cute-dsl.
source .venv/bin/activate
./scripts/lock-gpu-clock.sh # Lock GPU clocks for reproducibility
# FlashMLA base version (decode, s=1)
python -m mla_var3.kernel.cutile.mla.flash_mla --b=32 --s=1 --t=4096
# FlashAttention base version (prefill, s>=128)
python -m mla_var3.kernel.cutile.mla.flash_attention --b=32 --s=512 --t=4096
# MLA-var6+ base version (speculative decoding, 1 < s < 128)
python -m mla_var3.kernel.cutile.mla.mla_var6_plus --b=32 --s=16 --t=4096
# MLA-var6+ v4 (speculative decoding, 1 < s < 128)
python -m mla_var3.kernel.cutile.mla.mla_var6_plus v4 --b=32 --s=16 --t=4096
# Shortcut example (recommended)
python -m mla_var3.kernel mla_var6_plus v4 --b=32 --s=16 --t=4096
./scripts/reset-gpu-clock.sh # Reset GPU clocks when done
This automatically runs the kernel, performs autotuning, benchmarks it in annotation profiling mode (see Profiling section), and produces various profiling artifacts, including a performance report in out/profiles/annotation/<kernel_name>/<params>/<timestamp>/report.md.
Use --help for more options (e.g., disable autotuning, choose event / ncu / nsys, enable correctness checking, or override --out_dir).
Check Profiling section for more profiling modalities and detailed instructions.
├── src/mla_var3/
│ ├── kernel/
│ │ ├── cutile/mla/ # cuTile kernels (current default implementation path)
│ │ └── cute_python/mla/ # CuTe Python DSL kernels
│ ├── runtime/
│ │ ├── plan.py # KernelPlan CLI + profiling dispatch (shared abstraction)
│ │ ├── composite.py # KernelPipeline and ConcurrentKernels runtime
│ │ ├── ct_kernel.py # cuTile-specific runtime infrastructure (CtKernel)
│ │ └── paths.py # Profiling output path helpers
│ ├── conf/devices.json # GPU peak performance specs
├── tests/benchmark/ # Benchmark scripts
│ ├── bench_all_mla.py # Bench b=32, s=16, t=4096 across all kernels and versions
│ ├── bench_mla.py # Multi-backend sweep
│ ├── bench_mla_var6_plus.py # MLA-var6+ detailed benchmark
│ ├── bench_mla_var6_plus_v3.py # Version-specific MLA-var6+ benchmark
│ └── bench_mla_var6_plus_v4.py # Version-specific MLA-var6+ benchmark
├── docs/ # Documentation
│ ├── index.md # Central documentation entry point
│ ├── python-cuda-tile.md # cuTile Python DSL quick reference
│ ├── cost-analysis.md # Roofline and cost models
│ ├── kernels/mla-var6-plus.md # MLA-var6+ design docs
│ ├── knowledge/README.md # Shared vs language-specific optimization knowledge
│ ├── results/ # Experiment main documented results and analysis
│ └── devices/nvidia-geforce-rtx-5090.md # Device specs for RTX 5090 (used for static analysis)
├── scripts/ # GPU clock control scripts
│ ├── lock-gpu-clock.sh # Generic entry point that dispatches to a device-specific script
│ ├── rtx5090-lock-gpu-clock.sh # RTX 5090 clock profile
│ └── reset-gpu-clock.sh # Reset to default
├── third_party/cutile-python/ # cuTile Python DSL (NVIDIA)
├── docs/cutile-dsl/index.md # Stable cuTile Python DSL docs index
├── out/profiles/ # local profiling outputs (annotation/event/nsys/ncu)
├── out/tests/ # Test outputs
└── setup_cutile.sh # Installation script
Today, the repository runtime is most mature for cuTile Python DSL (cutile-dsl). CuTe Python DSL (cute-dsl) is the other supported workflow target.
Capturing detailed statistics for cuTile Python kernels requires NVIDIA Driver >= r580.126.09 (Linux) or r582.16 (Windows).
Kernel development is now language-aware.
cutile-dsl) by default when the optimization is still expressible through block-level structure, tiling, CTA remapping, or compiler hints.cute-dsl) when more explicit control is needed than cuTile exposes but a Python-authored workflow is still desirable.Language-specific design skills live under .claude/skills/design-<language>-kernel/. The current generic runtime abstractions are centered on KernelPlan, while CtKernel remains the existing cuTile-specific execution wrapper.
To streamline the development of new kernels, we provide templates for both single-kernel implementations and multi-kernel pipelines.
Copy src/mla_var3/kernel/cutile/mla/flash_attention/ as a template to src/mla_var3/kernel/cutile/mla/<new_kernel_name>/.
Required components:
Kernel function decorated with @ct.kernel:
@ct.kernel
def my_kernel(input_tensors, output_tensors, constants, TILE_PARAMS):
# Kernel logic using cuTile operations
Tiling dataclass extending Tiling:
@dataclass
class MyKernelTiling(Tiling):
Bm: int = 32
Bn: int = 64
num_ctas: int = None
occupancy: int = None
def validate(self, pd: "CtMyKernel"):
return self.Bm <= pd.s and self.Bn <= pd.t
KernelPlan subclass with required methods:
@dataclass
class CtMyKernel(KernelPlan):
b: int = 64
s: int = 1
# ... problem dimensions
tiling: MyKernelTiling = field(default_factory=MyKernelTiling)
def prepare_inputs(self, device) -> tuple:
# Create and return input tensors
def reference_fn(self, *inputs) -> tuple:
# Reference implementation for correctness check
def _autotune_configs(self) -> list[MyKernelTiling]:
# Return list of tiling configs to search
def _algorithmic_flops_bytes(self, *inputs, tiling) -> tuple[int, int]:
# Return (flops, bytes) for roofline prediction
def plan(self, *inputs) -> CtKernel:
# Return CtKernel / KernelPipeline / ConcurrentKernels runtime object
def plan_empty(self, peak_tflops, peak_gbps) -> CtKernel:
# For roofline-only prediction (no tensors needed)
CLI entry point:
if __name__ == "__main__":
kernel_plan = CtMyKernel()
kernel_plan.benchmark_kernel_argparse()
The CLI entry point directly supports the annotation and NCU profiling modes, which simply launch under NCU the entry-point.
Copy src/mla_var3/kernel/cutile/mla/mla_var6_plus/mla_var6_plus/ (i.e., base version of mla_var6_plus) directory structure:
mla_var6_plus.py - Pipeline entry point returning KernelPipelineattention_block_specialized.py - First kernel stage (CtKernel)decompress_combine.py - Second kernel stage (CtKernel)Each stage follows the single-kernel template. The pipeline combines them:
def plan(self, *inputs) -> KernelPipeline:
attention = attention_plan.plan(...)
decompress = decompress_plan.plan(...)
return KernelPipeline(pipeline_name="my_pipeline", stages=[attention, decompress])
For partially overlapped stages, the runtime also supports ConcurrentKernels inside KernelPipeline. This is used by newer MLA-var6+ variants to launch independent stage groups on separate CUDA streams while keeping the same outer pipeline interface.
The central object you implement is the KernelPlan subclass. It is the public entry point for a kernel version: it owns the problem sizes and dtype, exposes CLI arguments from its dataclass fields, prepares tensors, dispatches autotuning and profiling, optionally checks correctness, and builds the runtime object that is actually executed.
We will make an explicit example for cuTile Python DSL kernels, but the runtime can support any language.
For a single kernel, the implementation is split into three layers:
@ct.kernel function contains the GPU program itself.Tiling and stores launch/tile parameters plus any validation logic for legal configs.KernelPlan subclass: connects the kernel to the runtime.The main KernelPlan methods have distinct roles:
prepare_inputs(device): allocate and initialize the input tensors based on the fields of the dataclass.reference_fn(*inputs): return the reference outputs used by --check._autotune_configs(): return the candidate tilings searched by the runtime autotuner._algorithmic_flops_bytes(tiling): report analytical FLOPs and bytes for roofline metrics.plan(*inputs): build the executable runtime object for real tensors. For a single kernel this usually returns a CtKernel.plan_empty(peak_tflops, peak_gbps): build the same runtime structure without allocating real tensors, so roofline prediction and autotune-by-model can run without launching the full kernel path.For a single cuTile stage, plan(...) typically creates a CtKernel by providing:
@ct.kernel function,grid_fn(tiling) that computes the launch grid,args_fn(tiling) that returns the kernel arguments,For multi-stage kernels, keep each stage as its own KernelPlan subclass and use a top-level KernelPlan subclass only as an orchestrator. Its plan(...) should instantiate the stage plans, call each stage's plan(...), and connect the resulting runtime objects.
Use KernelPipeline when stages must run in order and later stages depend on earlier outputs:
def plan(self, *inputs) -> KernelPipeline:
stage1 = stage1_plan.plan(...)
stage2 = stage2_plan.plan(stage1.output_tensors[0], ...)
return KernelPipeline(_name="my_pipeline", stages=[stage1, stage2])
Use ConcurrentKernels when independent stages should launch on separate CUDA streams and overlap:
For instance, to have two stages a and b that can run concurrently, and then a final combine stage that depends on both:
def plan(self, *inputs) -> KernelPipeline:
a = plan_a.plan(...)
b = plan_b.plan(...)
concurrent = ConcurrentKernels(
_name="my_overlap_group",
concurrent_kernels=[a, b],
validate_joint_tiling_fn=validate_joint_tiling_fn,
)
combine = combine_plan.plan(a.output_tensors[0], b.output_tensors[0], ...)
return KernelPipeline(_name="my_pipeline", stages=[concurrent, combine])
ConcurrentKernels is a stage container, not a replacement for the top-level plan. It groups already-built runtime objects that can run together, optionally with a validate_joint_tiling_fn to reject combinations that oversubscribe SMs or violate other shared-resource constraints. In practice, the outer KernelPlan still defines the user-facing kernel version, while KernelPipeline and ConcurrentKernels express how the internal stages execute.
In other words, CtKernel, KernelPipeline, and ConcurrentKernels are all runtime objects that can be returned by any KernelPlan subclass, and they can be nested arbitrarily (e.g., a KernelPipeline can contain ConcurrentKernels, which in turn contain CtKernel stages).
The most important thing for easy and traceable kernel development is to follow strict naming and versioning conventions. This allows profiling utilities to work correctly and to track and document each source file with the corresponding profiling results and insights.
We use a nesting structure for the kernel packages, all inside the kernel sub-package:
kernel.cutile).kernel.cutile.mla for the MLA kernels used in MLA layers).kernel.cutile.mla.flash_attention for the FlashAttention kernel, kernel.cutile.mla.mla_var6_plus for the MLA-var6+ kernel design).kernel.cutile.mla.mla_var6_plus.mla_var6_plus_v2 for version 2 of the MLA-var6+ kernel design). The first version of a kernel design is named after the kernel design itself without a version suffix, and is commonly referred to as the "base version" (e.g., kernel.cutile.mla.mla_var6_plus.mla_var6_plus is the base version of the MLA-var6+ kernel design).kernel.cutile.mla.mla_var6_plus.mla_var6_plus_v2.mla_var6_plus_v2.py), which must contain the KernelPlan subclass that will be used as the main entry point for benchmarking and profiling.For CLI usage, versioned kernels are normally selected through the package entrypoint plus an optional version argument:
# Base version (also aliased internally as v0)
python -m mla_var3.kernel.cutile.mla.mla_var6_plus --b=32 --s=16 --t=4096
# Explicit version selection
python -m mla_var3.kernel.cutile.mla.mla_var6_plus v2 --b=32 --s=16 --t=4096
@ct.kernel must match the name of the kernel module in which it is contained (e.g., flash_attention.py should contain a kernel function named flash_attention).The main workflow consists of:
mla_var6_plus version 2).docs/kernels/mla-var6-plus.md, in the Development log section) with the insights derived from profiling and the motivation for the optimization.source .venv/bin/activate
python ./scripts/clone-kernel.py mla_var6_plus_v2 v5
The cloning workflow copies the package and updates versioned module names, @ct.kernel function names, KernelPlan and Tiling class names, intra-package imports, and quoted forward references such as validate(self, pd: "CtDecompressCombineV5").
source .venv/bin/activate
python -m mla_var3.kernel.cutile.mla.mla_var6_plus v5 --prof_type=disabled --check
The automatic entry-points generate the CLI arguments based on the KernelPlan dataclass fields, and based on the tiling dataclass fields.
All fields of the plan class and tiling class are exposed as CLI arguments (e.g., --b, --s, --Bm, --prof_type, etc.).
This allows to easily run the kernels with different parameters, and even specifying the tiling parameters directly (--no_autotune must be set to use the tiling parameters directly without autotuning).
# Show all options
python -m mla_var3.kernel.cutile.mla.flash_mla --help
# Run with custom parameters
python -m mla_var3.kernel.cutile.mla.flash_mla \
--b=64 --s=1 --t=4096 --h=128 --k=512 --dtype=bfloat16
Key options:
| Option | Description |
|---|---|
--no_autotune |
Disable autotuning (use default tiling) |
--retune |
Force re-autotuning (ignore cache) |
--tune_max_iters=N |
Limit autotuning iterations |
--prof_type |
Profiling mode: annotation (default), event, ncu, nsys, or disabled |
--out_dir |
Override the profiling output directory |
--check |
Verify numerical correctness against reference |
--bench_runs=N |
Number of profiling runs |
| Kernel | Description | Use Case |
|---|---|---|
| flash_mla | Latent-space attention using absorption trick. Reads compressed KV cache (latent space $k$-dim). | Decode ($s=1$), optimal bandwidth utilization |
| flash_attention | Standard FlashAttention on decompressed KV. Minimum FLOPs. | Prefill ($s \geq 128$), compute-bound regime |
| flash_mla_split_kv | FlashMLA with KV splitting. Produces partial attention outputs, then reduces. | Decode when batch/heads are small, needs more parallelism |
| mla_var6_plus | Block-specialized kernel splitting KV into latent ($o$ old tokens) + decompressed ($n$ new tokens) paths. Tunes operational intensity. | Speculative decode ($1 < s < 128$), interpolation regime |
Tensor shapes for Multi-Latent Attention kernels (DeepSeek-V3-like defaults):
| Symbol | Description | Default |
|---|---|---|
| $b$ | Batch size | 64 |
| $h$ | Number of heads | 128 |
| $s$ | Query sequence length | varies |
| $t$ | KV context length | 4096 |
| $d$ | Head dimension | 128 |
| $p$ | Positional embedding dimension | 64 |
| $k$ | Latent (compressed) dimension | 512 |
Important: Always lock GPU clocks for reproducible profiling results! Reset GPU clocks when done. See GPU Clock Control section for scripts and details.
| Mode | --prof_type |
Best for | Default output root |
|---|---|---|---|
| Annotation | annotation |
Profiling with Nsight Python (NCU under-the-hood), roofline summaries, and markdown reports | out/profiles/annotation/... |
| Event | event |
Low-overhead timing and roofline iteration (carries Python infrastructure overhead in the timing!) | out/profiles/event/... |
| NCU | ncu |
Detailed per-kernel Nsight Compute analysis (full sections and metrics, with source code annotations and NCU recommendations) | out/profiles/ncu/... |
| NSYS | nsys |
Stream overlap, launch ordering, and concurrency analysis | out/profiles/nsys/... |
| Disabled | disabled |
Plain execution without profiling | no profiling artifacts |
Direct kernel CLIs write under out/profiles/<prof_type>/<kernel>/<params>/<timestamp>/ by default. Established benchmarks (./tests/benchmark/*) typically write to benchmark-owned directories such as out/tests/bench_mla_var6_plus/... so repeated experiments can be organized explicitly.
Use case: Profiling with automatic roofline prediction and LLM-friendly reports. It contrasts explicitly in a table the roofline metrics and the NCU metrics. It only captures some hand-picked NCU metrics /sections. It does not contain the Nsight Compute optimization suggestions / insights or any source code annotations, for that use Mode 3.
Default profiling with automatic roofline prediction using the nsight-python package.
Usage from CLI entry point:
source .venv/bin/activate
./scripts/lock-gpu-clock.sh # Lock GPU clocks for reproducibility
python -m mla_var3.kernel.cutile.mla.flash_mla --b=64 --s=1 --t=4096 --prof_type=annotation
./scripts/reset-gpu-clock.sh # Reset GPU clocks when done
Usage from Python API (clocks must be locked before!):
from mla_var3.runtime.profiling.annotation import annotation_profile_plan
from mla_var3.kernel.cutile.mla.flash_mla.flash_mla.flash_mla import CtFlashMLA
plan = CtFlashMLA(b=64, s=1, t=4096, h=128, k=512, dtype="bfloat16")
prof_result = annotation_profile_plan(plan, out_dir="path/to/reports", version=None, label=None, bench_runs=None)
total_time_us = prof_result.get_total_time_us()
print(f"Total kernel execution time: {total_time_us:.2f} us")
Output in out/profiles/annotation/<kernel_name>/<params>/<timestamp>/:
report.md - LLM-friendly and human-readable performance summary with metrics and roofline analysisparams.json - Kernel plan parameterstiling/ - Active tiling configuration(s) used during the runcompiled/ - Compiled kernel artifacts (including MLIR)nsight_raw_metrics.csv - Raw NCU metricsnsight_formatted_metrics.csv - Formatted Nsight metricsroofline_metrics.csv - Measured + theoretical roofline comparisonUse case: Fast timing and roofline iteration without Nsight replay overhead. This mode works for both single kernels and KernelPipeline, but it still measures end-to-end runtime through the Python/runtime layer rather than raw kernel-only duration.
Note that this mode carries the overhead of Python execution and is not suitable for measuring pure kernel execution time, but it can be useful for quick iteration on single kernel exploration or when comparing different versions of a kernel with the same profiling overhead. For accurate kernel-level metrics, use Mode 1 and 3 (NCU) instead.
Usage from CLI entry point:
source .venv/bin/activate
./scripts/lock-gpu-clock.sh # Lock GPU clocks for reproducibility
python -m mla_var3.kernel.cutile.mla.mla_var6_plus v2 \
--b=32 --s=16 --t=4096 --prof_type=event
./scripts/reset-gpu-clock.sh # Reset GPU clocks when done
Usage from Python API (clocks must be locked before!):
from mla_var3.kernel.cutile.mla.flash_mla.flash_mla.flash_mla import CtFlashMLA
plan = CtFlashMLA(b=64, s=1, t=4096, h=128, k=512, dtype="bfloat16")
prof_result = plan.benchmark_kernel(device, prof_type=NsightProfType.EVENT, out_dir="path/to/reports")
total_time_us = prof_result.get_total_time_us()
print(f"Total kernel execution time: {total_time_us:.2f} us")
event mode writes to out/profiles/event/... and produces the common artifacts plus a lightweight report.md with an end-to-end roofline summary.
report.md includes:
Use case: Comprehensive profiling with detailed Nsight Compute metrics and source code annotations (i.e., ncu --set=full). It captures all NCU metrics and provides optimization suggestions and insights. It provides the most detailed level of profiling, but it does not include the roofline prediction and analysis that Mode 1 provides, so it is best used in conjunction with Mode 1 for a complete picture of kernel performance.
Usage from CLI entry point:
source .venv/bin/activate
./scripts/lock-gpu-clock.sh # Lock GPU clocks for reproducibility
python -m mla_var3.kernel.cutile.mla.flash_mla --b=64 --s=1 --t=4096 --prof_type=ncu
./scripts/reset-gpu-clock.sh # Reset GPU clocks when done
Usage from Python API (clocks must be locked before!):
from mla_var3.runtime.profiling.ncu import ncu_profile_plan
from mla_var3.kernel.cutile.mla.flash_mla.flash_mla.flash_mla import CtFlashMLA
plan = CtFlashMLA(b=64, s=1, t=4096, h=128, k=512, dtype="bfloat16")
prof_result = ncu_profile_plan(plan, out_dir="path/to/reports", version=None, label=None, bench_runs=None)
total_time_us = prof_result.get_total_time_us()
print(f"Total kernel execution time: {total_time_us:.2f} us")
Output locations:
report.md - compact LLM-friendly report with key metrics and NCU insights (some suggestions can be CUDA-specific and not directly controllable in cuTile Python DSL)report-verbose.md - verbose markdown report with detailed metrics and insights directly as reported by Nsight Compute (including cycles, achieved occupancy, and more)params.json - Kernel plan parameterstiling/ - Active tiling configuration(s) used during the runcompiled/ - Compiled kernel artifacts (including MLIR)report.ncu-rep - Binary report that can be opened in Nsight Compute UI<kernel-name>/ncu-details.csv - Per-kernel detailed Nsight Compute metrics export<kernel-name>/sass-metrics.csv - SASS source code with line-level metrics<kernel-name>/source.ptx - PTX source file extracted from the Nsight Compute reportannotated-src/ - Annotated cuTile source code with line-level markers and comments for hotspots and optimization suggestionsUse case: Use NSYS when you need stream-level traces, overlap analysis, launch ordering, or concurrency inspection for KernelPipeline / ConcurrentKernels workloads. It is especially useful for assessing the real overlap achieved by concurrent kernels designs (e.g., MLA-var6+ v4) and for identifying any unexpected serialization or bottlenecks in the pipeline execution.
Usage from CLI entry point:
source .venv/bin/activate
./scripts/lock-gpu-clock.sh # Lock GPU clocks for reproducibility
python -m mla_var3.kernel.cutile.mla.flash_mla --b=64 --s=1 --t=4096 --prof_type=nsys
./scripts/reset-gpu-clock.sh # Reset GPU clocks when done
Usage from Python API (clocks must be locked before!):
from mla_var3.runtime.profiling.nsys import nsys_profile_plan
from mla_var3.kernel.cutile.mla.flash_mla.flash_mla.flash_mla import CtFlashMLA
plan = CtFlashMLA(b=64, s=1, t=4096, h=128, k=512, dtype="bfloat16")
prof_result = nsys_profile_plan(plan, out_dir="path/to/reports", version=None, label=None, bench_runs=None)
total_time_us = prof_result.get_total_time_us()
print(f"Total kernel execution time: {total_time_us:.2f} us")
Output locations:
report.nsys-rep - Binary NSYS report that can be opened with NSYS UI for trace analysisreport.md - Markdown summary of key insights (kernel timing, overlap, stream concurrencystats/ - Directory containing CSV statistics for detailed analysischrome_trace.json - Chrome trace format for timeline visualization (can be loaded in Perfetto UI)The most useful NSYS artifacts are:
stats/report_cuda_gpu_trace.csv for kernel start times, durations, stream IDs, grid, and block dimensionsstats/report_nvtx_gpu_proj_trace.csv for projected NVTX range timingstats/report_cuda_api_sum.csv for launch and synchronization overheadparams.json - Kernel plan parameterstiling/ - Active tiling configuration(s) used during the runFor reproducible profiling, lock GPU clocks to fixed frequencies:
| Script | Purpose |
|---|---|
scripts/lock-gpu-clock.sh |
Generic entry point that dispatches to a device-specific profile |
scripts/rtx5090-lock-gpu-clock.sh |
Lock clocks (2407 MHz compute, 14001 MHz memory for RTX 5090) |
scripts/reset-gpu-clock.sh |
Reset to default frequencies |
Device peak specs configured in src/mla_var3/conf/devices.json:
{
"NVIDIA GeForce RTX 5090": {
"peak_float16_float32_tflops": 209.5,
"peak_bfloat16_float32_tflops": 209.5,
"peak_dram_bw_GBps": 1792.0
}
}
The values assume locked clocks and are used for roofline predictions, which is used in some cases to derive the best tiling configuration during autotuning (e.g., find the best $n$ in mla_var6_plus).
All the specs and details about the RTX 5090 are documented in docs/devices/nvidia-geforce-rtx-5090.md.
Reproducible benchmarks and tests are placed in tests.
Benchmark scripts under tests/benchmark/ are curated experiment drivers, not a uniform family of general-purpose CLIs. Some accept command-line arguments, while others are fixed scripts intended to be copied or edited for a specific profiling pass.
Most relevant benchmarks:
tests/benchmark/bench_all_mla.py: benchmark all MLA kernels and versions with a common set of parameters (b=32, s=16, t=4096, h=128, d=128, p=64, k=512, dtype=bfloat16) for easy comparison. It runs all kernels and versions sequentially with the same parameters and profiling mode and produces a comparison csv.| Document | Description |
|---|---|
| Documentation Index | Central entry point for stable generated docs and key project documentation |
| cuTile Python DSL Reference | Quick reference for cuTile operations (ct.load, ct.mma, etc.) |
| MLA Cost Analysis | Roofline analysis, kernel cost models, regime analysis |
| MLA-var6+ Design | Block specialization, kernel pipeline, implementation details |
| CUTLASS C++ Docs | Markdown-ready CUTLASS C++ and CuTe C++ docs assembled from authored guides plus Doxygen API export |
| NVIDIA GeForce RTX 5090 Specs | Detailed device specifications for static analysis and roofline modeling |
Build the CUTLASS C++ markdown tree with:
source .venv/bin/activate
./scripts/convert-cutlass-html-docs-to-markdown.sh
The builder preserves third_party/cutlass/media/docs/cpp as the narrative source, converts the remaining .rst pages with pandoc, and uses doxybook2 to export the API reference from third_party/cutlass/doxygen/xml as JSON before rendering it into Markdown locally.
The API export is pinned to scripts/doxybook2/config.json, and the stable API landing page is docs/cutlass-cpp/api/index.md.