Home › Knowledge Base › LLM Compression

LLM Compression

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

SigmaScale: LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices

Announce Type: new Abstract: We present SigmaScale, a method for learning auxiliary scaling matrices $S$ to aid truncated Singular Value Decomposition (SVD) based Large Language Model (LLM) compression. Instead of deriving scaling matrices analytically, SigmaScale optimizes two sets of vectors that define diagonal row and column scaling transformations under an activation-aware compression loss. We show that learned scaling lowers the effective intrinsic rank of weight matrices, as reflected...

arXiv CS 2d ago

Cross-Layer Subspace Coupling for LLM Compression: A Unifying Framework and Its Empirical Limits

Announce Type: new Abstract: Recent SVD based compression methods for large language models like SVD LLM and Basis Sharing can be unified under one optimization problem. While mathematical proofs and tests on Pythia models show this unified approach improves weight reconstruction error by up to 46% percent it fails in practical tasks.

arXiv CS 9d ago

Rethinking the Role of Tensor Decompositions in Post-Training LLM Compression

Announce Type: new Abstract: Post-training compression is essential for deploying large language models (LLMs) under tight resource constraints. Tensor decompositions have emerged as a promising direction, offering compact parameterizations well suited to Transformer weight structures. However, existing studies evaluate these methods in narrow settings, leaving unclear whether tensorization is effective at large-scale deployment.

arXiv CS 7d ago

Cross-Layer Subspace Coupling for LLM Compression: A Unifying Framework and Its Empirical Limits

arXiv:2605.30836v2 Announce Type: replace Abstract: Recent SVD based compression methods for large language models like SVD LLM and Basis Sharing can be unified under one optimization problem. While mathematical proofs and tests on Pythia models show this unified approach improves weight reconstruction error by up to 46% percent it fails in practical tasks.

arXiv CS 1d ago

From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

arXiv:2606.02559v1 Announce Type: new Abstract: Post-training compression of Large Language Models (LLMs) removes entire architectural components, either deleting them or replacing them with fitted modules. Existing replacement-based methods share two design constraints: full-layer granularity and contiguous selection. We argue that this is overly restrictive: in fact, redundancy in pretrained transformers is not confined to contiguous regions, nor does it evenly distribute between Attention...

arXiv CS 8d ago

ProjQ: Project-and-Quantize for Adapter-Aware LLM Compression

arXiv:2606.00494v2 Announce Type: replace Abstract: Post-Training Quantization (PTQ) and Low-Rank Adaptation (LoRA) constitute the standard pipeline for efficient Large Language Model (LLM) deployment. However, applying them sequentially poses a problem: PTQ often leaves behind random noise that is spread out (across the model's weights) in a way LoRA can't easily fix, meaning that LoRA ends up wasting its limited capacity trying to fix uncorrectable noise instead of improving task...

arXiv CS 7d ago

Swift-SVD: Theoretical Optimality Meets Practical Efficiency in Low-Rank LLM Compression

Announce Type: replace Abstract: The deployment of Large Language Models is constrained by the memory and bandwidth demands of static weights and dynamic Key-Value cache. SVD-based compression provides a hardware-friendly solution to reduce these costs. However, existing methods suffer from two key limitations: some are suboptimal in reconstruction error, while others are theoretically optimal but practically inefficient.

arXiv CS 1d ago

LLM Compression with Jointly Optimizing Architectural and Quantization choices

arXiv:2606.04063v1 Announce Type: new Abstract: Deploying large language models (LLMs) is challenging due to their significant memory and computational requirements. While some methods address this by developing small or tiny language models from scratch, these approaches demand extensive GPU training. Compressing pre-trained LLMs for edge devices offers a compelling alternative.

arXiv CS 6d ago

Joint Structural Pruning and Mixed-Precision Quantization for LLM Compression

arXiv:2606.07819v1 Announce Type: new Abstract: Recently, the efficiency of Large Language Models (LLMs) deployment has become a critical concern in practical applications. While post-training quantization (PTQ) and structural pruning are established techniques for reducing memory footprint and inference latency, most existing PTQ approaches optimize quantization errors on a per-layer basis, overlooking how errors accumulate and propagate through the network, often resulting in suboptimal...

arXiv CS 1d ago

Entropy Gate: Entropy Quenching for Near-Lossless Token Compression in LLM Pipelines

arXiv:2606.03739v1 Announce Type: new Abstract: LLM pipelines waste substantial token budgets on low-information content: repeated context, verbose responses, and redundant boilerplate. We introduce Entropy Gate, a token compression framework applying entropy quenching $-$ a thermodynamic process that progressively freezes out low-energy tokens while preserving semantic fidelity. Each token receives a multi-factor information energy $E(t)$ combining statistical, structural, and positional...

arXiv CS 7d ago