Home › Knowledge Base › Sparse Mixture of Experts

Sparse Mixture of Experts

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Expert Merging in Sparse Mixture of Experts with Nash Bargaining

arXiv:2510.16138v2 Announce Type: replace Abstract: Existing expert merging strategies for Sparse Mixture of Experts (SMoE) typically rely on input-dependent or input-independent averaging of expert parameters, but often lack a principled weighting mechanism. In this work, we reinterpret expert merging through the lens of game theory, revealing cooperative and competitive dynamics among experts. Based on this perspective, we introduce Nash Merging of Experts (NAMEx), a novel framework that...

arXiv CS 9d ago

Selective Sinkhorn Routing for Improved Sparse Mixture of Experts

arXiv:2511.08972v2 Announce Type: replace Abstract: Sparse Mixture-of-Experts (SMoE) models are scalable and computationally efficient, enabling large increases in model capacity with limited inference overhead. Existing SMoE methods often depend on auxiliary objectives, such as load-balancing loss and z-loss, or additional trainable components such as noisy gating. While these techniques encourage expert diversity, they can introduce objective misalignment, increase model complexity, or...

arXiv CS 5d ago

Rethinking Sparse Mixture of Experts from a Unified Perspective

arXiv:2503.22996v3 Announce Type: replace Abstract: Sparse Mixture of Experts (SMoE) models scale the capacity of models while maintaining constant computational overhead. SMoE methods fall into two categories: Token Choice, which routes each token to a fixed number of experts, and Expert Choice, which assigns a fixed number of tokens to each expert. However, the use of fixed budgets for tokens or experts causes both approaches to select irrelevant token-expert pairs or overlook critical...

arXiv CS 9d ago

Sparse Mixture-of-Experts Reward Models Learn Interpretable and Specialized Experts for Personalized Preference Modeling

Announce Type: new Abstract: Preference modeling plays a central role in reinforcement learning from human feedback (RLHF), enabling large language models (LLMs) to align with human values. However, most existing approaches assume a universal reward function, neglecting the diversity and heterogeneity of human preferences. To address this limitation without additional annotation costs, recent work has proposed learning multiple preference components from binary data and combining them to...

arXiv CS 6d ago

AlphaQ: Calibration-Free Bit Allocation for Mixture-of-Experts Quantization

Announce Type: new Abstract: Mixture-of-Experts (MoE) architectures scale model capacity through sparse expert activation, but their deployment remains memory-bound because all expert weights must reside in memory. Mixed-precision quantization can substantially reduce this footprint by assigning different bit-widths to different experts. Existing approaches, however, typically rely on calibration data to estimate expert importance and determine bit allocation.

arXiv CS 6d ago

MoG: Mixture of Experts for Graph-based Retrieval-Augmented Generation

arXiv:2605.31010v1 Announce Type: new Abstract: Retrieval-augmented generation is intensively studied to ground large language models on external evidence. However, retrieving from a unified knowledge base could inevitably introduce irrelevant information that may mislead generation for complex reasoning. Inspired by the conditional computation of mixture of experts (MoE), where a router sparsely selects specialized experts alongside shared ones for each input, we propose \textbf{M}ixture...

arXiv CS 9d ago

DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts

arXiv:2606.01062v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) models have become a leading approach for decoupling parameter count from computational cost in large language models, yet effectively scaling MoE performance remains a challenge. Prior work shows that fine-grained experts enlarge the space of expert combinations and improve flexibility, but they also impose substantial routing overhead, creating a new scalability bottleneck. In this paper, we explore a complementary...

arXiv CS 8d ago

Sparsely gated tiny linear experts

Announce Type: new Abstract: Sparsity allows scaling model parameters without proportionally increasing computational cost. While mixture of experts (MoE) models are made increasingly sparse, individual experts typically remain large and dense.

arXiv CS 2d ago

LoopMoE: Unifying Iterative Computation with Mixture-of-Experts for Language Modeling

arXiv:2606.04438v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) and looped architectures scale models along two orthogonal axes, namely parameter capacity and effective depth. However, mainstream looped architectures rely on dense backbones that couple parameter count with per-token FLOPs, which makes it impossible to isolate the effect of iterative computation under matched budgets. To this end, we present LoopMoE, a looped MoE language model that integrates sparse routing with...

arXiv CS 6d ago

Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning

arXiv:2606.07500v1 Announce Type: new Abstract: Continual learning in Large Language Models (LLMs) is hindered by the plasticity-stability dilemma, where acquiring new capabilities often leads to catastrophic forgetting of previous knowledge. Existing methods typically treat parameters uniformly, failing to distinguish between specific task knowledge and shared capabilities. We introduce Mixture of Sparse Experts for Task Agnostic Continual Learning (SETA), a framework that resolves the...

arXiv CS 2d ago