Back to writing
· 3 min read

GPU memory virtualization for LLMs

LLM inference is shifting from compute-bound to memory-bound. Treating GPU memory like an OS does RAM is the path forward.

LLM inference is shifting from being compute (FLOPs) bound to memory bound.

KV caches + activations overwhelm GPU RAM, making token streams behave like a 'new working set'.

I'm exploring a GPU memory virtualization layer, paging across NVLink/NVMe with cache-coherence, like OS-level VM for LLMs.

To simplify & writing the whole brain dump:

  • Early models: mostly FLOPs-bound (compute-heavy).

  • Modern large LLMs (esp. long context): increasingly memory-bound, since handling token histories (KV cache) overwhelms GPU memory & bandwidth.

Result: GPUs sit idle waiting for data, instead of doing math.

So, when we talk about GPU memory virtualization for LLMs, it means:

Treating GPU memory like an OS does RAM. Paging data (KV cache, activations) in/out efficiently so the GPU never stalls.


LLMs choke on long prompts because GPU memory is small.

I built TokenVM (POC) — a runtime that treats the KV cache like virtual memory.

Repo: github.com/Siddhant-K-code/tokenvm

Support independent writing

If this post was useful, consider supporting my open source work and independent writing.