Learning list
List of blogs / lectures / readings / papers that I thought were interesting.
C++
- Static Polymorphism
- Functors (Function Objects)
- Clarifying References
- Acquire / Release Semantics and Memory Ordering - The CPP blogs give a good intuition of what acquire-release semantics are, even though they're a little overkill for CUDA.
Inference Engine blogs
These are blogs I keep coming back / am subscribed to, as well as some articles that I particularly liked.
- LMSys
- FlashInfer - Better Sampling
- Perplexity's MoE Kernels
- Perplexity's Disagg Prefill-Decode - lot of good fundamentals here on RDMA + prefill-decode
- VLLM's blog *
- Hazy Research
- ThunderKittens 2.0 - Lot of good tidbits and nuances about PTX
- Modular Blackwell GEMM P4 - important for understanding SOTA GEMM
- AI21 Labs blog
- Cursor MXFP8 MoE Kernels - great blog that goes into quantization-specific optimizations required for Blackwell. Good for Hopper users porting over to Blackwell.
- VLLM Triton Backend Deep Dive - good overview of common optimizations used for Attention kernels, prefill and decode. look at the linked torch blog in the vllm blog as well
- Together Inference Engine Optimization Series
- I'm not a particular fan of vague blog posts about inference engine optimizations, since I think it does serve as just product advertisement, but Together still describes interesting coding agentic workloads they come across.
- Zero Redundancy DP
- A nice new way of thinking about reducing the redundancy in DP, although it's helpfulness is kind of restricted to nonquantized, low parallel regimes.
- Core Attention Disaggregation
- From the creators of PD Disagg, another disagg method that separates the straightforward linear components of the Attention block with the quadratic attention. Disagg enables better work balancing across attention units, better scheduling, and less communication and pipelining overhead in larger parallelism schemes.
- Speculative KV Coding
- An interesting blog that asks whether the KV cache itself can be losslessly compressed using a predictor model, similar to specdec. A first version of the predictor model is a quantized target model, which outputs the same KV cache and has relatively low noise compared to other options. This wasn't tested fully on agentic workloads, but it's interesting and the engineering lift is not that large.
- Flow: On first generation, use both target and predictor to generate KV cache, then encode into bits using an arithmetic encoder, using (mean, variance), then on reuse we only regenerate the predictor model's KVs and bits to recover KV cache.
PTX / NVIDIA notes
- MMA Layouts
- tcgen05 for dummies
- Colfax Intl Blogs
- Cutlass SM120 GEMM Guide - read this if you're going to be looking at anything related to SM120 GEMMs.
- CUDA Kernel Debugging tools: I think if you haven't read either of these yet, drop everything else you're doing and spend a day learning both tools. Claude can only get you so far in debugging.
- Register Cache Warps - cool technique showing that we can cache shared memory into intra-warp registers, which allows for optimizations in shuffle operations
- FA4 - The continued pattern of authors accompanying their papers with a great visual guide
- The scheduling section is interesting
- Qwen FlashQLA
- Colfax CLC Medium Dive
- CLC is useful for dynamic load balancing, especially in grouped gemm / ragged MoE scenarios where different tiles have vastly different workloads in terms of FLOPS.
General Paper List:
SpecDec
- Attention Sinks
- Spec Dec Analysis Paper
- Great paper analyzing the current most commonly used Specdec strategies
Systems / Kernels:
- Core Attention Disaggregation
- From the creators of PD Disagg, another disagg method that separates the straightforward linear components of the Attention block with the quadratic attention. Disagg enables better work balancing across attention units, better scheduling, and less communication and pipelining overhead in larger parallelism schemes.
- BLASST
- Similar to the FA-4 idea of softmax thresholding, NVIDIA introduces sparsity through selecting row-tiles in FA that contribute meaningfully to the O matrix by comparing local maxes to running maxes. Enforced sparsity in both prefill and decode gives performance gains without sacrificing model capabilities.
Lectures:
- NVIDIA Profiling
- GPU-Mode NCCL + NVSHMEM, Jeff Hammond
- Colfax Cutlass Layout Math
- I think this is interesting to understand the intuitions behind Cutlass, but IMO the Cris Cecka CuTE Layout paper is better because you get to understand the CuTe creator's decisions around why he built CuTe the way it is.
- Onur Mutlu Lectures, ETC Zurich Digital Design & Comp Arch
- For hardware, pairs well with a design/arch textbook
- Napkin Math
- Great talk (by the founder of Turbopuffer!) on how to reason about napkin math and system bounds extremely quickly.
- You need less precision than you think - prioritize breadth of solutions instead of depth in solution calculation
- Reiner Pope, Dwarkesh 2
- Good video for a quick overview of chip design, definitely go deeper on all the topics covered here, time-permitting.
Opinion Pieces
Still systems adjacent, but more of a commentary on the current state of accelerators.
- Patrick Toulme: Portability is a Myth
- How to deal with the DSL Explosion
- Barbarians at the Gate
- Meta commentary on how the role of system researchers is evolving with ADRS (AI Driven Research for Systems), and a general framework
Stuff that still needs to be read
- https://x.com/FireworksAI_HQ/status/2045366426819768794
- https://accelerated-computing.academy/fall25/lectures/
- https://www.youtube.com/watch?v=VhjUM_M71Wo
Kernels to learn / implement:
- DSA
- MLA
- Mega Moe Kernels from DeepGEMM