KV-Cache Compression via RoPE-Aligned Pruning

摘要

Long-context inference in large language models is bottlenecked by KV-Cache memory consumption. RAP (RoPE-Aligned Pruning) prunes RoPE-aligned column pairs to preserve rotation structure, enables B absorption, and eliminates reconstruction overhead. RAP achieves 20-30% joint reduction of KV-Cache, attention parameters, and FLOPs on LLaMA-3-8B and Mistral-7B, with attention latency reduced to 83% for prefill and 77% for decode.

类型
出版物
arXiv preprint
辛继灏
辛继灏
Ph.D. Student in Machine Learning Systems & Agentic AI