Unlocking LLM superpowers: How PagedAttention helps the memory maze

1. Memory fragmentation

Internal fragmentation

Systems pre-allocate a large chunk of memory for each request, assuming the maximum possible output length (e.g., 2048 tokens). However, if a request only generates a short output, much of that reserved memory goes unused, leading to significant waste.

External fragmentation

Because different requests reserve chunks of varying sizes, the GPU memory becomes scattered with unusable small gaps, making it hard to fit new requests even if total free memory is available. Our sources show that in existing systems, only 20.4% – 38.2% of KV cache memory is actually used to store token states, with the rest being waste.

2. No memory sharing

Advanced decoding techniques like parallel sampling or beam search often generate multiple outputs from a single prompt, meaning they could share parts of the KV cache. However, existing systems cannot easily share this memory because each sequence’s KV cache is in its own separate, contiguous block.

Donner Music, make your music with gear
Multi-Function Air Blower: Blowing, suction, extraction, and even inflation

Leave a reply

Please enter your comment!
Please enter your name here