Unlocking LLM superpowers: How PagedAttention helps the memory maze

1. Memory fragmentation

Internal fragmentation

Systems pre-allocate a large chunk of memory for each request, assuming the maximum possible output length (e.g., 2048 tokens). However, if a request only generates a short output, much of that reserved memory goes unused, leading to significant waste.

External fragmentation

Because different requests reserve chunks of varying sizes, the GPU memory becomes scattered with unusable small gaps, making it hard to fit new requests even if total free memory is available. Our sources show that in existing systems, only 20.4% – 38.2% of KV cache memory is actually used to store token states, with the rest being waste.

Advanced decoding techniques like parallel sampling or beam search often generate multiple outputs from a single prompt, meaning they could share parts of the KV cache. However, existing systems cannot easily share this memory because each sequence’s KV cache is in its own separate, contiguous block.

Unlocking LLM superpowers: How PagedAttention helps the memory maze

1. Memory fragmentation

Internal fragmentation

External fragmentation

Diversifying cloud resources is essential

watchOS 26.1, tvOS 26.1, And visionOS 26.1 Now Available To Download

Why Apple Still Puts Headphone Jacks On MacBooks, Even On New Models

Anthropic experiments with AI introspection