KV-Cache Compression via RoPE-Aligned Pruning

辛继灏, Tian Lyu, David Keyes, Hatem Ltaief, Marco Canini

二月, 2026

摘要

Long-context inference in large language models is bottlenecked by KV-Cache memory consumption. RAP (RoPE-Aligned Pruning) prunes RoPE-aligned column pairs to preserve rotation structure, enables B absorption, and eliminates reconstruction overhead. RAP achieves 20-30% joint reduction of KV-Cache, attention parameters, and FLOPs on LLaMA-3-8B and Mistral-7B, with attention latency reduced to 83% for prefill and 77% for decode.

类型

预印本

出版物

arXiv preprint

KV-Cache Compression via RoPE-Aligned Pruning

摘要

辛继灏

Ph.D. Student in Machine Learning Systems & Agentic AI