Home Knowledge Base Straight-Through Estimators

Straight-Through Estimators

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Beyond Discreteness: Sample Complexity Analysis of Straight-Through Estimator for 1-bit Quantization

arXiv:2505.18113v2 Announce Type: replace Abstract: Training quantized neural networks requires addressing the non-differentiable and discrete nature of the underlying optimization problem. To tackle this challenge, the straight-through estimator (STE) has become the most widely adopted heuristic, allowing backpropagation through discrete operations by introducing biased yet valid surrogate gradients. However, its theoretical properties remain largely unexplored, with few existing analyses...

arXiv CS 8d ago

You Can Learn Tokenization End-to-End with Reinforcement Learning

Announce Type: replace Abstract: Tokenization is a hardcoded compression step which remains in the training pipeline of Large Language Models (LLMs), despite a general trend towards architectures becoming increasingly end-to-end. Prior work has shown promising results at scale in bringing this compression step inside the LLMs' architecture with heuristics to draw token boundaries, and also attempts to learn these token boundaries with straight-through estimates, which treat the problem of...

arXiv CS 8d ago

Gradient estimators for parameter inference in discrete stochastic kinetic models

arXiv:2604.02121v2 Announce Type: replace Abstract: Stochastic kinetic models are ubiquitous in physics, yet inferring their parameters from experimental data remains challenging. For deterministic models, parameter inference often relies on gradients, which can be obtained efficiently through automatic differentiation (AD). However, AD cannot be applied directly to the Gillespie stochastic simulation algorithm (SSA), since sampling from a discrete set of reactions introduces...

arXiv Physics 6d ago

Gradient estimators for parameter inference in discrete stochastic kinetic models

Announce Type: replace-cross Abstract: Stochastic kinetic models are ubiquitous in physics, yet inferring their parameters from experimental data remains challenging. For deterministic models, parameter inference often relies on gradients, which can be obtained efficiently through automatic differentiation (AD). However, AD cannot be applied directly to the Gillespie stochastic simulation algorithm (SSA), since sampling from a discrete set of reactions introduces non-differentiable operations.

arXiv CS 6d ago

Scale When Needed: Adaptive Neuron-level Mixed Precision Quantization Aware Training

arXiv:2605.25054v2 Announce Type: replace Abstract: Deploying deep neural networks on resource-constrained 6G edge devices demands aggressive compression with minimal accuracy loss. Quantization-Aware Training (QAT) has emerged as a leading compression approach; however, existing mixed-precision methods typically operate at coarse layer- or channel-level granularity. These methods often rely on heuristic or search-based bit-allocation strategies, which may overlook fine-grained variability...

arXiv CS 2d ago

LayerRoute: Input-Conditioned Adaptive Layer Skipping via LoRA Fine-Tuning for Agentic Language Models

arXiv:2606.01838v1 Announce Type: new Abstract: Agentic language model systems alternate between two structurally distinct step types: structured tool calls (short, deterministic, low perplexity) and open-ended planning/reasoning steps (long, complex, high perplexity). Despite this heterogeneity, current inference systems apply identical compute to every step. We introduce LayerRoute, a lightweight adapter that learns to selectively skip transformer blocks on a per-input basis.

arXiv CS 8d ago

DOT-MoE: Differentiable Optimal Transport for MoEfication

arXiv:2606.01666v1 Announce Type: new Abstract: The scaling of Large Language Models (LLMs) has driven significant performance gains but created substantial challenges in inference efficiency. While Mixture of Experts (MoEs) architectures address this by decoupling model size from inference cost, training MoEs from scratch is often unstable and compute intensive. Conversion of pre-trained dense models into sparse MoEs has emerged as an alternative solution; however, existing methods...

arXiv CS 8d ago

Sensitivity as a Double-Edged Sword: A Trade-off Between Discriminability and Adversarial Robustness

Announce Type: new Abstract: Modern neural networks are highly susceptible to adversarial perturbations. In this work, we identify that part of this vulnerability stems from the sensitivity of the widely used fully connected (FC) classifiers to such perturbations. In contrast, simple $\ell_2$ distance-based classifiers exhibit significantly greater robustness.

arXiv CS 8d ago