HuggingFace Details PyTorch Profiling for Fused MLP Layers

HuggingFace

June 11, 2026

◷ 2 MIN

Original source

huggingface.co — read the full announcement →

The Profiling Deep-Dive on nn.Linear Fusion

HuggingFace just published Part 2 of its PyTorch profiling series, and it's a doozy. The post walks through transforming a standard stack of nn.Linear layers into a fused MLP, then shows exactly how profiling tools reveal the gains. The key claim: a fused implementation can cut GPU kernel launch overhead by 60–70%, translating to a 20–30% wall-time speedup on common transformer-sized layers. The tutorial uses PyTorch's built-in profiler, torch.profiler, and the newer Kineto trace. Specific numbers: for a 1024×4096×1024 MLP, the fused version reduces total kernel time from 2.1ms to 1.5ms on an A100. That's not academic—it's directly actionable for anyone optimizing inference or training.

Why Fused MLPs Matter Now

Fusion has been a well-known optimization in GPU computing for years—think CUDA kernels that combine elementwise ops. But applying it to PyTorch's nn.Linear layers has been tricky because the framework's default stack separates each linear transform with its own kernel launch. The cost? Latency from repeated memory reads and writes. HuggingFace's guide picks up where standard PyTorch docs leave off: it shows how to hand-roll a fused Triton kernel and profile it against the naive version. Since last year, PyTorch 2.0's torch.compile has made some fusion automatic, but many production setups still rely on eager mode. This tutorial fills a gap for those who need explicit control—and the profiles prove that the payoff is real.

What This Means for Model Optimization

Honestly, the most interesting part isn't the 30% speedup—it's the methodology. HuggingFace shows how to pinpoint exactly where cycles are wasted: launch overhead, memory bandwidth saturation, and suboptimal occupancy. For teams running large language models, this kind of granularity is gold. If you're paying $100/hr for A100 clusters, shaving 20% off an MLP step directly cuts costs. The tutorial also stresses that fusion doesn't always win; for small batch sizes the naive version can be faster due to less register pressure. That nuance is rare in marketing materials. This isn't a silver bullet—it's a surgical tool. And it's open-source, so anyone can reproduce the benchmarks.

What's Missing and Where to Look Next

The guide focuses on a single MLP shape and assumes a specific GPU architecture (Ampere). Will these profiles hold for Hopper or Blackwell? Maybe, but kernel characteristics change. HuggingFace doesn't cover auto-tuning or integration with torch.compile's inductor backend, which already does fusion for some patterns. Another open question: how does the fused MLP perform under mixed precision training? Memory bandwidth is often the bottleneck there, and fusion might not help if the data can't stay in cache. Finally, the tutorial doesn't address memory usage—fused kernels can increase register pressure, potentially limiting batch size. For production engineers, these are the details that matter more than the headline speedup. Keep an eye on HuggingFace's subsequent posts—they might dive into sharding and distributed fusion next.

Frequently Asked Questions

What is a fused MLP and why is it faster?▾

A fused MLP combines multiple nn.Linear layers into a single kernel execution, reducing the overhead of launching separate kernels and moving intermediate data between global memory. Instead of three separate matrix multiplications with two activations, a fused version runs them in one combined operation, cutting latency by up to 30% in the profiled examples.

Do I need special hardware to use fused MLPs?▾

Fused MLPs benefit most from modern GPUs with large register files and high memory bandwidth, like NVIDIA A100 or H100. They can run on older GPUs but the speedup will be smaller. The technique relies on CUDA kernel programming, so you'll need a recent PyTorch version (2.0+) and a GPU that supports Triton or custom CUDA kernels.

Is fusion always beneficial?▾

No. For very small batch sizes or shallow networks, the overhead of computing a fused kernel can outweigh the launch savings. HuggingFace's profiling shows that below a certain threshold, the naive nn.Linear stack is actually faster. The benefit grows with layer depth and batch size, so you have to profile your specific use case.

How does this compare to torch.compile's automatic fusion?▾

torch.compile can automatically fuse operations in some cases, especially with the inductor backend. But it doesn't always achieve the same degree of fusion as a hand-crafted kernel, and it's still experimental for many models. HuggingFace's manual approach gives you full control and can be more predictable, at the cost of development effort.

Where can I find the code and benchmarks from this tutorial?▾

The full tutorial is published on HuggingFace's blog with linked Jupyter notebooks. It includes the Triton kernel code, profiling scripts, and raw trace files. You can reproduce the results on any Ampere GPU. The series also has a Part 1 that covers basic profiling setup, so start there if you need a primer.