QAT
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency
Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency Since releasing Gemma 4 two months ago, we've been continuously working to expand its capabilities. First, we introduced Multi-Token Prediction (MTP) to accelerate inference, and just a couple of days ago, we released a 12B model to bridge the gap between our E4B and 26B MOE models. Today, we are releasing new checkpoints optimized with Quantization-Aware Training (QAT) to make Gemma 4 even more efficient, so...
Understanding Quantization-Aware Training: Gradients at Quantized Weights Bias to the Low-Loss Basin
arXiv:2606.09012v1 Announce Type: new Abstract: Post-training quantization (PTQ) converts a trained full-precision model into low-bit weights without task-level retraining, while quantization-aware training (QAT) incorporates quantization into the training loop. Although PTQ is efficient and often accurate at moderate bitwidths, it can fail sharply at aggressive bitwidths; QAT is more expensive but can often recover the lost accuracy. We propose a unified geometric framework that explains...
Scale When Needed: Adaptive Neuron-level Mixed Precision Quantization Aware Training
arXiv:2605.25054v2 Announce Type: replace Abstract: Deploying deep neural networks on resource-constrained 6G edge devices demands aggressive compression with minimal accuracy loss. Quantization-Aware Training (QAT) has emerged as a leading compression approach; however, existing mixed-precision methods typically operate at coarse layer- or channel-level granularity. These methods often rely on heuristic or search-based bit-allocation strategies, which may overlook fine-grained variability...
Learning Quantized Continuous Controllers for Integer Hardware
arXiv:2511.07046v4 Announce Type: replace Abstract: Deploying continuous-control reinforcement learning policies on embedded hardware requires meeting tight latency and power budgets. Small FPGAs can deliver these, but only if costly floating-point pipelines are avoided. We study quantization-aware training (QAT) of policies for integer inference and we present a learning-to-hardware pipeline that automatically selects low-bit policies and synthesizes them to an Artix-7 FPGA.
Learning Fine-grained Parameter Sharing via Sparse Tensor Decomposition
Announce Type: replace Abstract: Large neural networks achieve state-of-the-art performance on many tasks, yet their sheer size hinders deployment on resource-constrained devices. Among existing compression approaches, cross-layer parameter sharing remains relatively unexplored for transformer models.
Surrogate Neural Architecture Codesign Package (SNAC-Pack)
arXiv:2605.16138v2 Announce Type: replace Abstract: Neural architecture search (NAS) is a powerful approach for automating model design, but existing methods often optimize for accuracy alone or rely on proxy metrics such as bit operations (BOPs) that correlate poorly with hardware cost. This gap is particularly large for FPGA deployment, where cost is dominated by a multi-dimensional budget of lookup tables, DSPs, flip-flops, BRAM, and latency. We present the Surrogate Neural Architecture...
MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second
From the first roaring racer of the combustion age to the sonic boom that shattered the sound barrier, humanity's hunger for speed is written into our very DNA. The speed of AI reasoning is no different — it defines the boundaries of intelligence itself. When a model is fast enough, it ceases to be a tool you wait on and becomes an extension of your own thinking: responding in real time, iterating in an instant, collaborating without friction.