Learning list

List of blogs / lectures / readings / papers that I thought were interesting.

C++

Inference Engine blogs

These are blogs I keep coming back / am subscribed to, as well as some articles that I particularly liked.

  • VLLM's blog *
  • Hazy Research
  • Modular Blackwell GEMM P4 - important for understanding SOTA GEMM
  • AI21 Labs blog
  • Cursor MXFP8 MoE Kernels - great blog that goes into quantization-specific optimizations required for Blackwell. Good for Hopper users porting over to Blackwell.
  • VLLM Triton Backend Deep Dive - good overview of common optimizations used for Attention kernels, prefill and decode. look at the linked torch blog in the vllm blog as well
  • Together Inference Engine Optimization Series
    • I'm not a particular fan of vague blog posts about inference engine optimizations, since I think it does serve as just product advertisement, but Together still describes interesting coding agentic workloads they come across.
  • Zero Redundancy DP
    • A nice new way of thinking about reducing the redundancy in DP, although it's helpfulness is kind of restricted to nonquantized, low parallel regimes.
  • Core Attention Disaggregation
    • From the creators of PD Disagg, another disagg method that separates the straightforward linear components of the Attention block with the quadratic attention. Disagg enables better work balancing across attention units, better scheduling, and less communication and pipelining overhead in larger parallelism schemes.
  • Speculative KV Coding
    • An interesting blog that asks whether the KV cache itself can be losslessly compressed using a predictor model, similar to specdec. A first version of the predictor model is a quantized target model, which outputs the same KV cache and has relatively low noise compared to other options. This wasn't tested fully on agentic workloads, but it's interesting and the engineering lift is not that large.
    • Flow: On first generation, use both target and predictor to generate KV cache, then encode into bits using an arithmetic encoder, using (mean, variance), then on reuse we only regenerate the predictor model's KVs and bits to recover KV cache.

PTX / NVIDIA notes

General Paper List:

SpecDec

Systems / Kernels:

  • Core Attention Disaggregation
    • From the creators of PD Disagg, another disagg method that separates the straightforward linear components of the Attention block with the quadratic attention. Disagg enables better work balancing across attention units, better scheduling, and less communication and pipelining overhead in larger parallelism schemes.
  • BLASST
    • Similar to the FA-4 idea of softmax thresholding, NVIDIA introduces sparsity through selecting row-tiles in FA that contribute meaningfully to the O matrix by comparing local maxes to running maxes. Enforced sparsity in both prefill and decode gives performance gains without sacrificing model capabilities.

Lectures:

Opinion Pieces

Still systems adjacent, but more of a commentary on the current state of accelerators.

Stuff that still needs to be read

  • https://x.com/FireworksAI_HQ/status/2045366426819768794
  • https://accelerated-computing.academy/fall25/lectures/
  • https://www.youtube.com/watch?v=VhjUM_M71Wo

Kernels to learn / implement:

  • DSA
  • MLA
  • Mega Moe Kernels from DeepGEMM