Model Performance Delta
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
From "Weak" Signals to Strong Models: Preference Delta Aggregation with LoRA Merging
arXiv:2606.00357v2 Announce Type: replace Abstract: Training strong large language models (LLMs) requires high-quality supervision, which is often scarce. Recent work shows that paired preference data from weak-weaker model pairs (e.g., Qwen3 4B over 1.7B), despite the limited quality of individual responses, can provide an effective supervision signal through relative quality deltas, which we term a "weak" signal. This motivates a key research question: can multiple "weak" signals be...
Rethinking Search as Code Generation
Rethinking Search as Code Generation Evolving search from monolithic services to programmable primitives for the era of agent harnesses. Search is a core primitive for AI systems. Frontier models grow more capable by the month, but they still need access to fresh, accurate, and well-curated knowledge from the wider world.
Knowledge Index of Noah's Ark
arXiv:2606.05104v2 Announce Type: replace Abstract: Knowledge benchmarks for LLMs face three issues: scaling-driven designs that do not operationalize disciplinary representativeness; flat-payment annotation that permits lazy consensus; and unaudited ranking instability under bounded test budgets. We introduce KINA, an 899-item benchmark across 261 fine-grained disciplines, with two formal results. First, we cast representativeness as a coverage-style objective over expert-elicited anchors...
Knowledge Index of Noah's Ark
Announce Type: new Abstract: Knowledge benchmarks for LLMs face three issues: scaling-driven designs that do not operationalize disciplinary representativeness; flat-payment annotation that permits lazy consensus; and unaudited ranking instability under bounded test budgets. We introduce KINA, an 899-item benchmark across 261 fine-grained disciplines, with two formal results. First, we cast representativeness as a coverage-style objective over expert-elicited anchors and operationalize...
NormEval: A Unified Multi-Metric Framework for Evaluating Semantic Fidelity in Text Normalization
arXiv:2511.20409v2 Announce Type: replace Abstract: Text normalization methods such as stemming and lemmatization are fundamental components of NLP pipelines. As new normalization tools are developed for diverse languages, evaluation methodologies remain fragmented, relying on Compression Ratio, downstream accuracy, or sequence-to-sequence prediction scores in isolation, failing to distinguish between beneficial vocabulary reduction and harmful semantic distortion.
On the Evaluation of Spiking Neural Network Configurations for Network Intrusion Detection
Announce Type: new Abstract: Network intrusion detection is a core component of modern cybersecurity infrastructure, yet the deep learning models that dominate the field are computationally demanding, motivating interest in lightweight alternatives suited to edge and neuromorphic deployment. Spiking Neural Networks (SNNs) are therefore a natural candidate, but their design space, spanning the choice of neuron model and spike encoding scheme, remains poorly characterized for intrusion...
The High W Challenge: Robust Neutrino Energy Estimators for LArTPCs
Announce Type: replace Abstract: Accurate determination of the neutrino energy is central to precision oscillation measurements. In this work, we introduce the W$^2$-based estimator, a new neutrino energy estimator based on the measurement of the final-state hadronic invariant mass. This estimator is particularly designed to be employed in liquid-argon time-projection chambers exposed to broadband beams that span the challenging transition region between shallow inelastic scattering and deep...
GenFT: A Generative Parameter-Efficient Fine-Tuning Method for Pretrained Foundation Models
arXiv:2506.11042v2 Announce Type: replace Abstract: Parameter-efficient fine-tuning (PEFT) has emerged as a resource-efficient strategy for adapting Pretrained Foundation Models (PFMs) by learning a small number of task-specific updates $\Delta W$. Existing methods often learn $\Delta W$ largely independently of pretrained weights $W_0$, or exploit $W_0$ mainly through initialization or simple reparameterization. To further leverage the structural information encoded in $W_0$, we propose...
Deep learning four decades of human migration
Abstract Human migration is a fundamental driver of global demographic change, shaping population structure, labour markets and social policy across countries1,2,3. Although long-term migration patterns are often linked to economic development4, they can shift rapidly in response to shocks such as conflict, environmental crises and political change5. Despite its importance, migration remains difficult to measure consistently: existing data are sparse, concentrated in high-income settings and...
DDOR: Delta Debugging for Explainable Overrefusal Testing and Repair
Announce Type: new Abstract: While safety alignment and guardrails help large language models (LLMs) avoid harmful outputs, they can also induce overrefusal, i.e., unwarranted rejection of benign queries that merely appear risky. We present DDOR (Delta Debugging for OverRefusal), a fully automated and explainable framework for overrefusal testing and repair in a black-box setting, where only model inputs and outputs are accessible and internal safety mechanisms remain opaque. DDOR applies...