GPU memory virtualization for LLMs

LLM inference is shifting from being compute (FLOPs) bound to memory bound.

KV caches + activations overwhelm GPU RAM, making token streams behave like a 'new working set'.

I'm exploring a GPU memory virtualization layer, paging across NVLink/NVMe with cache-coherence, like OS-level VM for LLMs.

To simplify & writing the whole brain dump:

Early models: mostly FLOPs-bound (compute-heavy).
Modern large LLMs (esp. long context): increasingly memory-bound, since handling token histories (KV cache) overwhelms GPU memory & bandwidth.

Result: GPUs sit idle waiting for data, instead of doing math.

So, when we talk about GPU memory virtualization for LLMs, it means:

Treating GPU memory like an OS does RAM. Paging data (KV cache, activations) in/out efficiently so the GPU never stalls.

LLMs choke on long prompts because GPU memory is small.

I built TokenVM (POC) — a runtime that treats the KV cache like virtual memory.