LLM inference is shifting from being compute (FLOPs) bound to memory bound.
KV caches + activations overwhelm GPU RAM, making token streams behave like a 'new working set'.
I'm exploring a GPU memory virtualization layer, paging across NVLink/NVMe with cache-coherence, like OS-level VM for LLMs.
To simplify & writing the whole brain dump:
Early models: mostly FLOPs-bound (compute-heavy).
Modern large LLMs (esp. long context): increasingly memory-bound, since handling token histories (KV cache) overwhelms GPU memory & bandwidth.
Result: GPUs sit idle waiting for data, instead of doing math.
So, when we talk about GPU memory virtualization for LLMs, it means:
Treating GPU memory like an OS does RAM. Paging data (KV cache, activations) in/out efficiently so the GPU never stalls.
LLMs choke on long prompts because GPU memory is small.
I built TokenVM (POC) — a runtime that treats the KV cache like virtual memory.