1. Memory fragmentation
Internal fragmentation
Systems pre-allocate a large chunk of memory for each request, assuming the maximum possible output length (e.g., 2048 tokens). However, if a request only generates a short output, much of that reserved memory goes unused, leading to significant waste.
External fragmentation
Because different requests reserve chunks of varying sizes, the GPU memory becomes scattered with unusable small gaps, making it hard to fit new requests even if total free memory is available. Our sources show that in existing systems, only 20.4% – 38.2% of KV cache memory is actually used to store token states, with the rest being waste.
2. No memory sharing
Advanced decoding techniques like parallel sampling or beam search often generate multiple outputs from a single prompt, meaning they could share parts of the KV cache. However, existing systems cannot easily share this memory because each sequence’s KV cache is in its own separate, contiguous block.