Home Business & Finance OASIS: Outlier-Aware LUT-Based GEMM with Dual-Side...
Business & Finance

OASIS: Outlier-Aware LUT-Based GEMM with Dual-Side Quantization for LLM Inference Acceleration

Key Points

arXiv:2507.23035v4 Announce Type: replace Abstract: Large language models (LLMs) have demonstrated impressive capabilities across a wide range of applications, but demand substantial memory and compute resources during inference. Existing quantization methods expose a trade-off between efficiency and accuracy: weight-only quantization (WOQ) incurs costly dequantization overheads, while integer weight-and-activation quantization (INT-WAQ) reduces precision and degrades model quality....

arXiv:2507.23035v4 Announce Type: replace Abstract: Large language models (LLMs) have demonstrated impressive capabilities across a wide range of applications, but demand substantial memory and compute resources during inference. Existing quantization methods expose a trade-off between efficiency and accuracy: weight-only quantization (WOQ) incurs costly dequantization overheads, while integer weight-and-activation quantization (INT-WAQ) reduces precision and degrades model quality. Non-uniform weight-and-activation quantization (NU-WAQ) can better capture the non-uniform distributions of LLM weights and activations, yet remains incompatible with conventional low-precision compute units. This paper presents OASIS, a lookup table (LUT)-based architecture that enables efficient general matrix multiplication (GEMM) between non-uniformly quantized weights and activations without requiring dequantization. OASIS employs pre-computed Cartesian Product LUTs, achieving a 64x reduction in LUT size and enabling a 1024x higher computational parallelism over existing LUT-based GEMM methods. To preserve accuracy under aggressive activation quantization, OASIS introduces an outlier-aware quantization scheme with concurrent LUT-based GEMM and error compensation for outliers. Furthermore, we design Orizuru, an efficient top-k detection engine for real-time activation outlier identification. According to extensive evaluations, OASIS incurs an average accuracy drop of only 1.98% compared to the FP16 baseline, which is 5.18% lower than Atom. On the hardware side, OASIS achieves an average 3.00x speedup and a 1.44x energy efficiency improvement compared to the FIGLUT accelerator.
GEMM (PERSON) WOQ (ORG) LLM (ORG) OASIS (LOCATION) Cartesian (ORG) LUT (ORG) Orizuru (ORG) Atom (ORG) FIGLUT (ORG)
Originally published by arXiv CS Read original →