design-cute-dsl-kernel

Installation
CLI
npx skills add https://github.com/pepperu96/hyper-mla --skill design-cute-dsl-kernel

Installieren Sie diesen Skill über die CLI und beginnen Sie mit der Verwendung des SKILL.md-Workflows in Ihrem Arbeitsbereich.

Zuletzt aktualisiert am 4/24/2026

HyperMLA

HyperMLA implements optimized Multi-Latent Attention (MLA) kernels for DeepSeek-V3/R1 and Kimi-V2 inference. It provides kernels that interpolate between FlashMLA (optimal for decode, $s=1$) and FlashAttention (optimal for prefill, $s \geq 128$), addressing the performance gap in speculative decoding regimes ($1 < s < 128$).

Table of Contents

Quick Start

Installation

./setup_cutile.sh
source .venv/bin/activate

CLI entry points

Make sure to activate the virtual environment with source .venv/bin/activate before running the kernels.
Make sure to lock GPU clocks for reproducible profiling results! See GPU Clock Control section for scripts and details.
CRITICAL: Always run one profiling at a time on the same device, GPU memory is limited!

We offer a double-versioned kernel structure (see Kernel Naming and Versioning) to keep multiple versions of the same kernel design intact for easy comparison. The main entry points are of two types:

  • All kernels and versions:
source .venv/bin/activate && ./scripts/lock-gpu-clock.sh
python3 -m mla_var3.kernel <kernel-name> [<version>] [kernel arguments]
  • Kernel specific:
source .venv/bin/activate && ./scripts/lock-gpu-clock.sh
python3 -m mla_var3.kernel.cutile.mla.flash_attention [<version>] --b=32 --s=512 --t=4096

The framework is language-aware. cuTile Python DSL (cutile-dsl) is the default implementation path, and kernel packages are organized under src/mla_var3/kernel/<language>/.... The currently supported language keys are cutile-dsl and cute-dsl.

Example: Profiling a Kernel

source .venv/bin/activate
./scripts/lock-gpu-clock.sh  # Lock GPU clocks for reproducibility

# FlashMLA base version (decode, s=1)
python -m mla_var3.kernel.cutile.mla.flash_mla --b=32 --s=1 --t=4096

# FlashAttention base version (prefill, s>=128)
python -m mla_var3.kernel.cutile.mla.flash_attention --b=32 --s=512 --t=4096

# MLA-var6+ base version (speculative decoding, 1 < s < 128)
python -m mla_var3.kernel.cutile.mla.mla_var6_plus --b=32 --s=16 --t=4096

# MLA-var6+ v4 (speculative decoding, 1 < s < 128)
python -m mla_var3.kernel.cutile.mla.mla_var6_plus v4 --b=32 --s=16 --t=4096

# Shortcut example (recommended)
python -m mla_var3.kernel mla_var6_plus v4 --b=32 --s=16 --t=4096

./scripts/reset-gpu-clock.sh  # Reset GPU clocks when done

This automatically runs the kernel, performs autotuning, benchmarks it in annotation profiling mode (see Profiling section), and produces various profiling artifacts, including a performance report in out/profiles/annotation/<kernel_name>/<params>/<timestamp>/report.md.

Use --help for more options (e.g., disable autotuning, choose event / ncu / nsys, enable correctness checking, or override --out_dir).
Check Profiling section for more profiling modalities and detailed instructions.

Directory Structure

├── src/mla_var3/
│   ├── kernel/
│   │   ├── cutile/mla/              # cuTile kernels (current default implementation path)
│   │   └── cute_python/mla/         # CuTe Python DSL kernels
│   ├── runtime/
│   │   ├── plan.py                  # KernelPlan CLI + profiling dispatch (shared abstraction)
│   │   ├── composite.py             # KernelPipeline and ConcurrentKernels runtime
│   │   ├── ct_kernel.py             # cuTile-specific runtime infrastructure (CtKernel)
│   │   └── paths.py                 # Profiling output path helpers
│   ├── conf/devices.json            # GPU peak performance specs
├── tests/benchmark/                 # Benchmark scripts
│   ├── bench_all_mla.py             # Bench b=32, s=16, t=4096 across all kernels and versions
│   ├── bench_mla.py                 # Multi-backend sweep
│   ├── bench_mla_var6_plus.py       # MLA-var6+ detailed benchmark
│   ├── bench_mla_var6_plus_v3.py    # Version-specific MLA-var6+ benchmark
│   └── bench_mla_var6_plus_v4.py    # Version-specific MLA-var6+ benchmark
├── docs/                            # Documentation
│   ├── index.md                     # Central documentation entry point
│   ├── python-cuda-tile.md          # cuTile Python DSL quick reference
│   ├── cost-analysis.md             # Roofline and cost models
│   ├── kernels/mla-var6-plus.md     # MLA-var6+ design docs
│   ├── knowledge/README.md          # Shared vs language-specific optimization knowledge
│   ├── results/                     # Experiment main documented results and analysis
│   └── devices/nvidia-geforce-rtx-5090.md     # Device specs for RTX 5090 (used for static analysis)
├── scripts/                         # GPU clock control scripts
│   ├── lock-gpu-clock.sh            # Generic entry point that dispatches to a device-specific script
│   ├── rtx5090-lock-gpu-clock.sh    # RTX 5090 clock profile
│   └── reset-gpu-clock.sh           # Reset to default
├── third_party/cutile-python/       # cuTile Python DSL (NVIDIA)
├── docs/cutile-dsl/index.md              # Stable cuTile Python DSL docs index
├── out/profiles/                    # local profiling outputs (annotation/event/nsys/ncu)
├── out/tests/                       # Test outputs
└── setup_cutile.sh                  # Installation script

Dependencies

  • Python >= 3.11.14
  • CUDA Toolkit > 13.1 (required for cuTile Python DSL)
  • NVIDIA GPU with recent compute capability; support depends on the chosen implementation language and kernel family
  • cuTile Python DSL - NVIDIA's Python DSL for GPU kernels (docs)

Today, the repository runtime is most mature for cuTile Python DSL (cutile-dsl). CuTe Python DSL (cute-dsl) is the other supported workflow target.

Capturing detailed statistics for cuTile Python kernels requires NVIDIA Driver >= r580.126.09 (Linux) or r582.16 (Windows).

Kernel Development

Kernel development is now language-aware.

  • Use cuTile Python DSL (cutile-dsl) by default when the optimization is still expressible through block-level structure, tiling, CTA remapping, or compiler hints.
  • Switch to CuTe Python DSL (cute-dsl) when more explicit control is needed than cuTile exposes but a Python-authored workflow is still desirable.

Language-specific design skills live under .claude/skills/design-<language>-kernel/. The current generic runtime abstractions are centered on KernelPlan, while CtKernel remains the existing cuTile-specific execution wrapper.

To streamline the development of new kernels, we provide templates for both single-kernel implementations and multi-kernel pipelines.

Template for Single-Kernel Implementation

Copy src/mla_var3/kernel/cutile/mla/flash_attention/ as a template to src/mla_var3/kernel/cutile/mla/<new_kernel_name>/.

Required components:

  1. Kernel function decorated with @ct.kernel:

    @ct.kernel
    def my_kernel(input_tensors, output_tensors, constants, TILE_PARAMS):
        # Kernel logic using cuTile operations
    
  2. Tiling dataclass extending Tiling:

    @dataclass
    class MyKernelTiling(Tiling):
        Bm: int = 32
        Bn: int = 64
        num_ctas: int = None
        occupancy: int = None
    
        def validate(self, pd: "CtMyKernel"):
            return self.Bm <= pd.s and self.Bn <= pd.t
    
  3. KernelPlan subclass with required methods:

    @dataclass
    class CtMyKernel(KernelPlan):
        b: int = 64
        s: int = 1
        # ... problem dimensions
        tiling: MyKernelTiling = field(default_factory=MyKernelTiling)
    
        def prepare_inputs(self, device) -> tuple:
            # Create and return input tensors
    
        def reference_fn(self, *inputs) -> tuple:
            # Reference implementation for correctness check
    
        def _autotune_configs(self) -> list[MyKernelTiling]:
            # Return list of tiling configs to search
    
        def _algorithmic_flops_bytes(self, *inputs, tiling) -> tuple[int, int]:
            # Return (flops, bytes) for roofline prediction
    
        def plan(self, *inputs) -> CtKernel:
            # Return CtKernel / KernelPipeline / ConcurrentKernels runtime object
    
        def plan_empty(self, peak_tflops, peak_gbps) -> CtKernel:
            # For roofline-only prediction (no tensors needed)
    
  4. CLI entry point:

    if __name__ == "__main__":
        kernel_plan = CtMyKernel()
        kernel_plan.benchmark_kernel_argparse()
    

    The CLI entry point directly supports the annotation and NCU profiling modes, which simply launch under NCU the entry-point.

Template for Multi-Kernel Pipeline

Copy src/mla_var3/kernel/cutile/mla/mla_var6_plus/mla_var6_plus/ (i.e., base version of mla_var6_plus) directory structure:

  • mla_var6_plus.py - Pipeline entry point returning KernelPipeline
  • attention_block_specialized.py - First kernel stage (CtKernel)
  • decompress_combine.py - Second kernel stage (CtKernel)

Each stage follows the single-kernel template. The pipeline combines them:

def plan(self, *inputs) -> KernelPipeline:
    attention = attention_plan.plan(...)
    decompress = decompress_plan.plan(...)
    return KernelPipeline(pipeline_name="my_pipeline", stages=[attention, decompress])

For partially overlapped stages, the runtime also supports ConcurrentKernels inside KernelPipeline. This is used by newer MLA-var6+ variants to launch independent stage groups on separate CUDA streams while keeping the same outer pipeline interface.

Runtime Structure and What To Implement

The central object you implement is the KernelPlan subclass. It is the public entry point for a kernel version: it owns the problem sizes and dtype, exposes CLI arguments from its dataclass fields, prepares tensors, dispatches autotuning and profiling, optionally checks correctness, and builds the runtime object that is actually executed.

We will make an explicit example for cuTile Python DSL kernels, but the runtime can support any language.

cuTile Python DSL Kernels

For a single kernel, the implementation is split into three layers:

  1. cuTile kernel function: the @ct.kernel function contains the GPU program itself.
  2. Tiling dataclass: extends Tiling and stores launch/tile parameters plus any validation logic for legal configs.
  3. KernelPlan subclass: connects the kernel to the runtime.

The main KernelPlan methods have distinct roles:

  • prepare_inputs(device): allocate and initialize the input tensors based on the fields of the dataclass.
  • reference_fn(*inputs): return the reference outputs used by --check.
  • _autotune_configs(): return the candidate tilings searched by the runtime autotuner.
  • _algorithmic_flops_bytes(tiling): report analytical FLOPs and bytes for roofline metrics.
  • plan(*inputs): build the executable runtime object for real tensors. For a single kernel this usually returns a CtKernel.
  • plan_empty(peak_tflops, peak_gbps): build the same runtime structure without allocating real tensors, so roofline prediction and autotune-by-model can run without launching the full kernel path.

For a single cuTile stage, plan(...) typically creates a CtKernel by providing:

  • the compiled @ct.kernel function,
  • a grid_fn(tiling) that computes the launch grid,
  • an args_fn(tiling) that returns the kernel arguments,
  • the selected tiling, tensors, and analytical FLOPs/bytes metadata.

For multi-stage kernels, keep each stage as its own KernelPlan subclass and use a top-level KernelPlan subclass only as an orchestrator. Its plan(...) should instantiate the stage plans, call each stage's plan(...), and connect the resulting runtime objects.

Use KernelPipeline when stages must run in order and later stages depend on earlier outputs:

def plan(self, *inputs) -> KernelPipeline:
    stage1 = stage1_plan.plan(...)
    stage2 = stage2_plan.plan(stage1.output_tensors[0], ...)
    return KernelPipeline(_name="my_pipeline", stages=[stage1, stage2])

Use ConcurrentKernels when independent stages should launch on separate CUDA streams and overlap:

For instance, to have two stages a and b that can run concurrently, and then a final combine stage that depends on both:

def plan(self, *inputs) -> KernelPipeline:
    a = plan_a.plan(...)
    b = plan_b.plan(...)
    concurrent = ConcurrentKernels(
        _name="my_overlap_group",
        concurrent_kernels=[a, b],
        validate_joint_tiling_fn=validate_joint_tiling_fn,
    )
    combine = combine_plan.plan(a.output_tensors[0], b.output_tensors[0], ...)
    return KernelPipeline(_name="my_pipeline", stages=[concurrent, combine])

ConcurrentKernels is a stage container, not a replacement for the top-level plan. It groups already-built runtime objects that can run together, optionally with a validate_joint_tiling_fn to reject combinations that oversubscribe SMs or violate other shared-resource constraints. In practice, the outer KernelPlan still defines the user-facing kernel version, while KernelPipeline and ConcurrentKernels express how the internal stages execute.
In other words, CtKernel, KernelPipeline, and ConcurrentKernels are all runtime objects that can be returned by any KernelPlan subclass, and they can be nested arbitrarily (e.g., a KernelPipeline can contain ConcurrentKernels, which in turn contain CtKernel stages).

Kernel Naming and Versioning

The most important thing for easy and traceable kernel development is to follow strict naming and versioning conventions. This allows profiling utilities to work correctly and to track and document each source file with the corresponding profiling results and insights.

Kernel structure and versioning

We use a nesting structure for the kernel packages, all inside the kernel sub-package:

  • The first level corresponds to which programming language or DSL the kernel is implemented in (e.g., kernel.cutile).
  • The second level corresponds to which layer the actual kernel is needed for (e.g., kernel.cutile.mla for the MLA kernels used in MLA layers).
  • The third level corresponds to the kernel design (e.g., kernel.cutile.mla.flash_attention for the FlashAttention kernel, kernel.cutile.mla.mla_var6_plus for the MLA-var6+ kernel design).
  • The fourth level corresponds to the kernel version (e.g., kernel.cutile.mla.mla_var6_plus.mla_var6_plus_v2 for version 2 of the MLA-var6+ kernel design). The first version of a kernel design is named after the kernel design itself without a version suffix, and is commonly referred to as the "base version" (e.g., kernel.cutile.mla.mla_var6_plus.mla_var6_plus is the base version of the MLA-var6+ kernel design).
  • Inside the kernel version sub-package, there must be a kernel module named after the kernel full name (e.g., kernel.cutile.mla.mla_var6_plus.mla_var6_plus_v2.mla_var6_plus_v2.py), which must contain the KernelPlan subclass that will be used as the main entry point for benchmarking and profiling.

For CLI usage, versioned kernels are normally selected through the package entrypoint plus an optional version argument:

# Base version (also aliased internally as v0)
python -m mla_var3.kernel.cutile.mla.mla_var6_plus --b=32 --s=16 --t=4096

# Explicit version selection
python -m mla_var3.kernel.cutile.mla.mla_var6_plus v2 --b=32 --s=16 --t=4096

Kernel naming requirements

  • The name of the kernel function decorated with @ct.kernel must match the name of the kernel module in which it is contained (e.g., flash_attention.py should contain a kernel function named flash_attention).

Writing a new kernel

  1. Choose a template (single-kernel or multi-kernel pipeline).
  2. Copy the template directory and rename everything properly.
  3. Implement kernel logic, tiling, and plan methods.
  4. For a new version of an existing kernel design, prefer the cloning workflow below instead of manual copy/rename.

Optimizing a kernel iteratively

The main workflow consists of:

  1. Profile an existing version of the kernel and identify bottlenecks and optimization opportunities (e.g., mla_var6_plus version 2).
  2. Document it in the development log (e.g., docs/kernels/mla-var6-plus.md, in the Development log section) with the insights derived from profiling and the motivation for the optimization.
  3. Create the next version from the previous one with the cloning script:
source .venv/bin/activate
python ./scripts/clone-kernel.py mla_var6_plus_v2 v5

The cloning workflow copies the package and updates versioned module names, @ct.kernel function names, KernelPlan and Tiling class names, intra-package imports, and quoted forward references such as validate(self, pd: "CtDecompressCombineV5").

  1. Modify the cloned kernel code to implement the optimization.
  2. Verify correctness before profiling again:
source .venv/bin/activate
python -m mla_var3.kernel.cutile.mla.mla_var6_plus v5 --prof_type=disabled --check
  1. Repeat.

Configuration via CLI

The automatic entry-points generate the CLI arguments based on the KernelPlan dataclass fields, and based on the tiling dataclass fields.
All fields of the plan class and tiling class are exposed as CLI arguments (e.g., --b, --s, --Bm, --prof_type, etc.).
This allows to easily run the kernels with different parameters, and even specifying the tiling parameters directly (--no_autotune must be set to use the tiling parameters directly without autotuning).

# Show all options
python -m mla_var3.kernel.cutile.mla.flash_mla --help

# Run with custom parameters
python -m mla_var3.kernel.cutile.mla.flash_mla \
    --b=64 --s=1 --t=4096 --h=128 --k=512 --dtype=bfloat16

Key options:

Option Description
--no_autotune Disable autotuning (use default tiling)
--retune Force re-autotuning (ignore cache)
--tune_max_iters=N Limit autotuning iterations
--prof_type Profiling mode: annotation (default), event, ncu, nsys, or disabled
--out_dir Override the profiling output directory
--check Verify numerical correctness against reference
--bench_runs=N Number of profiling runs

Kernel Inventory

Kernel Description Use Case
flash_mla Latent-space attention using absorption trick. Reads compressed KV cache (latent space $k$-dim). Decode ($s=1$), optimal bandwidth utilization
flash_attention Standard FlashAttention on decompressed KV. Minimum FLOPs. Prefill ($s \geq 128$), compute-bound regime
flash_mla_split_kv FlashMLA with KV splitting. Produces partial attention outputs, then reduces. Decode when batch/heads are small, needs more parallelism
mla_var6_plus Block-specialized kernel splitting KV into latent ($o$ old tokens) + decompressed ($n$ new tokens) paths. Tunes operational intensity. Speculative decode ($1 < s < 128$), interpolation regime

Tensor shapes for Multi-Latent Attention kernels (DeepSeek-V3-like defaults):

Symbol Description Default
$b$ Batch size 64
$h$ Number of heads 128
$s$ Query sequence length varies
$t$ KV context length 4096
$d$ Head dimension 128
$p$ Positional embedding dimension 64
$k$ Latent (compressed) dimension 512

Profiling

Important: Always lock GPU clocks for reproducible profiling results! Reset GPU clocks when done. See GPU Clock Control section for scripts and details.

Profiling Modes At A Glance

Mode --prof_type Best for Default output root
Annotation annotation Profiling with Nsight Python (NCU under-the-hood), roofline summaries, and markdown reports out/profiles/annotation/...
Event event Low-overhead timing and roofline iteration (carries Python infrastructure overhead in the timing!) out/profiles/event/...
NCU ncu Detailed per-kernel Nsight Compute analysis (full sections and metrics, with source code annotations and NCU recommendations) out/profiles/ncu/...
NSYS nsys Stream overlap, launch ordering, and concurrency analysis out/profiles/nsys/...
Disabled disabled Plain execution without profiling no profiling artifacts

Direct kernel CLIs write under out/profiles/<prof_type>/<kernel>/<params>/<timestamp>/ by default. Established benchmarks (./tests/benchmark/*) typically write to benchmark-owned directories such as out/tests/bench_mla_var6_plus/... so repeated experiments can be organized explicitly.

Mode 1: Annotation (Nsight Python)

Use case: Profiling with automatic roofline prediction and LLM-friendly reports. It contrasts explicitly in a table the roofline metrics and the NCU metrics. It only captures some hand-picked NCU metrics /sections. It does not contain the Nsight Compute optimization suggestions / insights or any source code annotations, for that use Mode 3.

Default profiling with automatic roofline prediction using the nsight-python package.

Usage from CLI entry point:

source .venv/bin/activate
./scripts/lock-gpu-clock.sh  # Lock GPU clocks for reproducibility
python -m mla_var3.kernel.cutile.mla.flash_mla --b=64 --s=1 --t=4096 --prof_type=annotation
./scripts/reset-gpu-clock.sh  # Reset GPU clocks when done

Usage from Python API (clocks must be locked before!):

from mla_var3.runtime.profiling.annotation import annotation_profile_plan
from mla_var3.kernel.cutile.mla.flash_mla.flash_mla.flash_mla import CtFlashMLA

plan = CtFlashMLA(b=64, s=1, t=4096, h=128, k=512, dtype="bfloat16")
prof_result = annotation_profile_plan(plan, out_dir="path/to/reports", version=None, label=None, bench_runs=None)
total_time_us = prof_result.get_total_time_us()
print(f"Total kernel execution time: {total_time_us:.2f} us")

Output in out/profiles/annotation/<kernel_name>/<params>/<timestamp>/:

  • report.md - LLM-friendly and human-readable performance summary with metrics and roofline analysis
  • params.json - Kernel plan parameters
  • tiling/ - Active tiling configuration(s) used during the run
  • compiled/ - Compiled kernel artifacts (including MLIR)
  • nsight_raw_metrics.csv - Raw NCU metrics
  • nsight_formatted_metrics.csv - Formatted Nsight metrics
  • roofline_metrics.csv - Measured + theoretical roofline comparison

Mode 2: Event

Use case: Fast timing and roofline iteration without Nsight replay overhead. This mode works for both single kernels and KernelPipeline, but it still measures end-to-end runtime through the Python/runtime layer rather than raw kernel-only duration.
Note that this mode carries the overhead of Python execution and is not suitable for measuring pure kernel execution time, but it can be useful for quick iteration on single kernel exploration or when comparing different versions of a kernel with the same profiling overhead. For accurate kernel-level metrics, use Mode 1 and 3 (NCU) instead.

Usage from CLI entry point:

source .venv/bin/activate
./scripts/lock-gpu-clock.sh  # Lock GPU clocks for reproducibility
python -m mla_var3.kernel.cutile.mla.mla_var6_plus v2 \
    --b=32 --s=16 --t=4096 --prof_type=event
./scripts/reset-gpu-clock.sh  # Reset GPU clocks when done

Usage from Python API (clocks must be locked before!):

from mla_var3.kernel.cutile.mla.flash_mla.flash_mla.flash_mla import CtFlashMLA

plan = CtFlashMLA(b=64, s=1, t=4096, h=128, k=512, dtype="bfloat16")
prof_result = plan.benchmark_kernel(device, prof_type=NsightProfType.EVENT, out_dir="path/to/reports")
total_time_us = prof_result.get_total_time_us()
print(f"Total kernel execution time: {total_time_us:.2f} us")

event mode writes to out/profiles/event/... and produces the common artifacts plus a lightweight report.md with an end-to-end roofline summary.

report.md includes:

  • Roofline metrics: Achieved TFLOPs/s, GB/s, arithmetic intensity

Mode 3: Detailed NCU

Use case: Comprehensive profiling with detailed Nsight Compute metrics and source code annotations (i.e., ncu --set=full). It captures all NCU metrics and provides optimization suggestions and insights. It provides the most detailed level of profiling, but it does not include the roofline prediction and analysis that Mode 1 provides, so it is best used in conjunction with Mode 1 for a complete picture of kernel performance.

Usage from CLI entry point:

source .venv/bin/activate
./scripts/lock-gpu-clock.sh  # Lock GPU clocks for reproducibility
python -m mla_var3.kernel.cutile.mla.flash_mla --b=64 --s=1 --t=4096 --prof_type=ncu
./scripts/reset-gpu-clock.sh  # Reset GPU clocks when done

Usage from Python API (clocks must be locked before!):

from mla_var3.runtime.profiling.ncu import ncu_profile_plan
from mla_var3.kernel.cutile.mla.flash_mla.flash_mla.flash_mla import CtFlashMLA

plan = CtFlashMLA(b=64, s=1, t=4096, h=128, k=512, dtype="bfloat16")
prof_result = ncu_profile_plan(plan, out_dir="path/to/reports", version=None, label=None, bench_runs=None)
total_time_us = prof_result.get_total_time_us()
print(f"Total kernel execution time: {total_time_us:.2f} us")

Output locations:

  • report.md - compact LLM-friendly report with key metrics and NCU insights (some suggestions can be CUDA-specific and not directly controllable in cuTile Python DSL)
  • report-verbose.md - verbose markdown report with detailed metrics and insights directly as reported by Nsight Compute (including cycles, achieved occupancy, and more)
  • params.json - Kernel plan parameters
  • tiling/ - Active tiling configuration(s) used during the run
  • compiled/ - Compiled kernel artifacts (including MLIR)
  • report.ncu-rep - Binary report that can be opened in Nsight Compute UI
  • source code annotations and per-kernel source exports:
    • <kernel-name>/ncu-details.csv - Per-kernel detailed Nsight Compute metrics export
    • <kernel-name>/sass-metrics.csv - SASS source code with line-level metrics
    • <kernel-name>/source.ptx - PTX source file extracted from the Nsight Compute report
    • annotated-src/ - Annotated cuTile source code with line-level markers and comments for hotspots and optimization suggestions

Mode 4: NSYS

Use case: Use NSYS when you need stream-level traces, overlap analysis, launch ordering, or concurrency inspection for KernelPipeline / ConcurrentKernels workloads. It is especially useful for assessing the real overlap achieved by concurrent kernels designs (e.g., MLA-var6+ v4) and for identifying any unexpected serialization or bottlenecks in the pipeline execution.

Usage from CLI entry point:

source .venv/bin/activate
./scripts/lock-gpu-clock.sh  # Lock GPU clocks for reproducibility
python -m mla_var3.kernel.cutile.mla.flash_mla --b=64 --s=1 --t=4096 --prof_type=nsys
./scripts/reset-gpu-clock.sh  # Reset GPU clocks when done

Usage from Python API (clocks must be locked before!):

from mla_var3.runtime.profiling.nsys import nsys_profile_plan
from mla_var3.kernel.cutile.mla.flash_mla.flash_mla.flash_mla import CtFlashMLA

plan = CtFlashMLA(b=64, s=1, t=4096, h=128, k=512, dtype="bfloat16")
prof_result = nsys_profile_plan(plan, out_dir="path/to/reports", version=None, label=None, bench_runs=None)
total_time_us = prof_result.get_total_time_us()
print(f"Total kernel execution time: {total_time_us:.2f} us")

Output locations:

  • report.nsys-rep - Binary NSYS report that can be opened with NSYS UI for trace analysis
  • report.md - Markdown summary of key insights (kernel timing, overlap, stream concurrency
  • stats/ - Directory containing CSV statistics for detailed analysis
  • chrome_trace.json - Chrome trace format for timeline visualization (can be loaded in Perfetto UI)

The most useful NSYS artifacts are:

  • stats/report_cuda_gpu_trace.csv for kernel start times, durations, stream IDs, grid, and block dimensions
  • stats/report_nvtx_gpu_proj_trace.csv for projected NVTX range timing
  • stats/report_cuda_api_sum.csv for launch and synchronization overhead
  • params.json - Kernel plan parameters
  • tiling/ - Active tiling configuration(s) used during the run

GPU Clock Control

For reproducible profiling, lock GPU clocks to fixed frequencies:

Script Purpose
scripts/lock-gpu-clock.sh Generic entry point that dispatches to a device-specific profile
scripts/rtx5090-lock-gpu-clock.sh Lock clocks (2407 MHz compute, 14001 MHz memory for RTX 5090)
scripts/reset-gpu-clock.sh Reset to default frequencies

Device peak specs configured in src/mla_var3/conf/devices.json:

{
    "NVIDIA GeForce RTX 5090": {
        "peak_float16_float32_tflops": 209.5,
        "peak_bfloat16_float32_tflops": 209.5,
        "peak_dram_bw_GBps": 1792.0
    }
}

The values assume locked clocks and are used for roofline predictions, which is used in some cases to derive the best tiling configuration during autotuning (e.g., find the best $n$ in mla_var6_plus).
All the specs and details about the RTX 5090 are documented in docs/devices/nvidia-geforce-rtx-5090.md.

Tests and Benchmarks

Reproducible benchmarks and tests are placed in tests.

Benchmark scripts under tests/benchmark/ are curated experiment drivers, not a uniform family of general-purpose CLIs. Some accept command-line arguments, while others are fixed scripts intended to be copied or edited for a specific profiling pass.

Most relevant benchmarks:

  • tests/benchmark/bench_all_mla.py: benchmark all MLA kernels and versions with a common set of parameters (b=32, s=16, t=4096, h=128, d=128, p=64, k=512, dtype=bfloat16) for easy comparison. It runs all kernels and versions sequentially with the same parameters and profiling mode and produces a comparison csv.

Detailed Documentation

Document Description
Documentation Index Central entry point for stable generated docs and key project documentation
cuTile Python DSL Reference Quick reference for cuTile operations (ct.load, ct.mma, etc.)
MLA Cost Analysis Roofline analysis, kernel cost models, regime analysis
MLA-var6+ Design Block specialization, kernel pipeline, implementation details
CUTLASS C++ Docs Markdown-ready CUTLASS C++ and CuTe C++ docs assembled from authored guides plus Doxygen API export
NVIDIA GeForce RTX 5090 Specs Detailed device specifications for static analysis and roofline modeling

Build the CUTLASS C++ markdown tree with:

source .venv/bin/activate
./scripts/convert-cutlass-html-docs-to-markdown.sh

The builder preserves third_party/cutlass/media/docs/cpp as the narrative source, converts the remaining .rst pages with pandoc, and uses doxybook2 to export the API reference from third_party/cutlass/doxygen/xml as JSON before rendering it into Markdown locally.
The API export is pinned to scripts/doxybook2/config.json, and the stable API landing page is docs/cutlass-cpp/api/index.md.