Learning list

0000-01-017 min

List of blogs / lectures / readings / papers that I thought were interesting.

C++

Static Polymorphism
Functors (Function Objects)
Clarifying References
Acquire / Release Semantics and Memory Ordering - The CPP blogs give a good intuition of what acquire-release semantics are, even though they're a little overkill for CUDA.

Inference Engine blogs

These are blogs I keep coming back / am subscribed to, as well as some articles that I particularly liked.

LMSys
- Int4 QAT RL
FlashInfer - Better Sampling
Perplexity's MoE Kernels
Perplexity's Disagg Prefill-Decode - lot of good fundamentals here on RDMA + prefill-decode

VLLM's blog *
Hazy Research
- ThunderKittens 2.0 - Lot of good tidbits and nuances about PTX
Modular Blackwell GEMM P4 - important for understanding SOTA GEMM
AI21 Labs blog
- Decode Cache Bugs in VLLM for Mamba
Cursor MXFP8 MoE Kernels - great blog that goes into quantization-specific optimizations required for Blackwell. Good for Hopper users porting over to Blackwell.
VLLM Triton Backend Deep Dive - good overview of common optimizations used for Attention kernels, prefill and decode. look at the linked torch blog in the vllm blog as well
Together Inference Engine Optimization Series
- I'm not a particular fan of vague blog posts about inference engine optimizations, since I think it does serve as just product advertisement, but Together still describes interesting coding agentic workloads they come across.
Zero Redundancy DP
- A nice new way of thinking about reducing the redundancy in DP, although it's helpfulness is kind of restricted to nonquantized, low parallel regimes.
Core Attention Disaggregation
- From the creators of PD Disagg, another disagg method that separates the straightforward linear components of the Attention block with the quadratic attention. Disagg enables better work balancing across attention units, better scheduling, and less communication and pipelining overhead in larger parallelism schemes.
Speculative KV Coding
- An interesting blog that asks whether the KV cache itself can be losslessly compressed using a predictor model, similar to specdec. A first version of the predictor model is a quantized target model, which outputs the same KV cache and has relatively low noise compared to other options. This wasn't tested fully on agentic workloads, but it's interesting and the engineering lift is not that large.
- Flow: On first generation, use both target and predictor to generate KV cache, then encode into bits using an arithmetic encoder, using (mean, variance), then on reuse we only regenerate the predictor model's KVs and bits to recover KV cache.

PTX / NVIDIA notes

MMA Layouts
tcgen05 for dummies
Colfax Intl Blogs
- FP8 Training
- Persistent Kernels
Cutlass SM120 GEMM Guide - read this if you're going to be looking at anything related to SM120 GEMMs.
CUDA Kernel Debugging tools: I think if you haven't read either of these yet, drop everything else you're doing and spend a day learning both tools. Claude can only get you so far in debugging.
- CUDA-GDB
- Compute Sanitizer
Register Cache Warps - cool technique showing that we can cache shared memory into intra-warp registers, which allows for optimizations in shuffle operations
FA4 - The continued pattern of authors accompanying their papers with a great visual guide
- The scheduling section is interesting
Qwen FlashQLA
Colfax CLC Medium Dive
- CLC is useful for dynamic load balancing, especially in grouped gemm / ragged MoE scenarios where different tiles have vastly different workloads in terms of FLOPS.

General Paper List:

SpecDec

Attention Sinks
- Attention Sinks in SpecDec
Spec Dec Analysis Paper
- Great paper analyzing the current most commonly used Specdec strategies

Systems / Kernels:

Core Attention Disaggregation
- From the creators of PD Disagg, another disagg method that separates the straightforward linear components of the Attention block with the quadratic attention. Disagg enables better work balancing across attention units, better scheduling, and less communication and pipelining overhead in larger parallelism schemes.
BLASST
- Similar to the FA-4 idea of softmax thresholding, NVIDIA introduces sparsity through selecting row-tiles in FA that contribute meaningfully to the O matrix by comparing local maxes to running maxes. Enforced sparsity in both prefill and decode gives performance gains without sacrificing model capabilities.

Lectures:

NVIDIA Profiling
GPU-Mode NCCL + NVSHMEM, Jeff Hammond
Colfax Cutlass Layout Math
- I think this is interesting to understand the intuitions behind Cutlass, but IMO the Cris Cecka CuTE Layout paper is better because you get to understand the CuTe creator's decisions around why he built CuTe the way it is.
Onur Mutlu Lectures, ETC Zurich Digital Design & Comp Arch
- For hardware, pairs well with a design/arch textbook
Napkin Math
- Great talk (by the founder of Turbopuffer!) on how to reason about napkin math and system bounds extremely quickly.
- You need less precision than you think - prioritize breadth of solutions instead of depth in solution calculation
Reiner Pope, Dwarkesh 2
- Good video for a quick overview of chip design, definitely go deeper on all the topics covered here, time-permitting.

Opinion Pieces

Still systems adjacent, but more of a commentary on the current state of accelerators.

Patrick Toulme: Portability is a Myth
- How to deal with the DSL Explosion
Barbarians at the Gate
- Meta commentary on how the role of system researchers is evolving with ADRS (AI Driven Research for Systems), and a general framework

Stuff that still needs to be read

https://x.com/FireworksAI_HQ/status/2045366426819768794
https://accelerated-computing.academy/fall25/lectures/
https://www.youtube.com/watch?v=VhjUM_M71Wo

Kernels to learn / implement:

DSA
MLA
Mega Moe Kernels from DeepGEMM