Home › Knowledge Base › Model Performance Delta

Model Performance Delta

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

From "Weak" Signals to Strong Models: Preference Delta Aggregation with LoRA Merging

arXiv:2606.00357v2 Announce Type: replace Abstract: Training strong large language models (LLMs) requires high-quality supervision, which is often scarce. Recent work shows that paired preference data from weak-weaker model pairs (e.g., Qwen3 4B over 1.7B), despite the limited quality of individual responses, can provide an effective supervision signal through relative quality deltas, which we term a "weak" signal. This motivates a key research question: can multiple "weak" signals be...

arXiv CS 2d ago

Rethinking Search as Code Generation

Rethinking Search as Code Generation Evolving search from monolithic services to programmable primitives for the era of agent harnesses. Search is a core primitive for AI systems. Frontier models grow more capable by the month, but they still need access to fresh, accurate, and well-curated knowledge from the wider world.

Hacker News 8d ago

Knowledge Index of Noah's Ark

arXiv:2606.05104v2 Announce Type: replace Abstract: Knowledge benchmarks for LLMs face three issues: scaling-driven designs that do not operationalize disciplinary representativeness; flat-payment annotation that permits lazy consensus; and unaudited ranking instability under bounded test budgets. We introduce KINA, an 899-item benchmark across 261 fine-grained disciplines, with two formal results. First, we cast representativeness as a coverage-style objective over expert-elicited anchors...

arXiv CS 5d ago

Knowledge Index of Noah's Ark

Announce Type: new Abstract: Knowledge benchmarks for LLMs face three issues: scaling-driven designs that do not operationalize disciplinary representativeness; flat-payment annotation that permits lazy consensus; and unaudited ranking instability under bounded test budgets. We introduce KINA, an 899-item benchmark across 261 fine-grained disciplines, with two formal results. First, we cast representativeness as a coverage-style objective over expert-elicited anchors and operationalize...

arXiv CS 6d ago

NormEval: A Unified Multi-Metric Framework for Evaluating Semantic Fidelity in Text Normalization

arXiv:2511.20409v2 Announce Type: replace Abstract: Text normalization methods such as stemming and lemmatization are fundamental components of NLP pipelines. As new normalization tools are developed for diverse languages, evaluation methodologies remain fragmented, relying on Compression Ratio, downstream accuracy, or sequence-to-sequence prediction scores in isolation, failing to distinguish between beneficial vocabulary reduction and harmful semantic distortion.

arXiv CS 8d ago

On the Evaluation of Spiking Neural Network Configurations for Network Intrusion Detection

Announce Type: new Abstract: Network intrusion detection is a core component of modern cybersecurity infrastructure, yet the deep learning models that dominate the field are computationally demanding, motivating interest in lightweight alternatives suited to edge and neuromorphic deployment. Spiking Neural Networks (SNNs) are therefore a natural candidate, but their design space, spanning the choice of neuron model and spike encoding scheme, remains poorly characterized for intrusion...

arXiv CS 8d ago

The High W Challenge: Robust Neutrino Energy Estimators for LArTPCs

Announce Type: replace Abstract: Accurate determination of the neutrino energy is central to precision oscillation measurements. In this work, we introduce the W$^2$-based estimator, a new neutrino energy estimator based on the measurement of the final-state hadronic invariant mass. This estimator is particularly designed to be employed in liquid-argon time-projection chambers exposed to broadband beams that span the challenging transition region between shallow inelastic scattering and deep...

arXiv Physics 7d ago

GenFT: A Generative Parameter-Efficient Fine-Tuning Method for Pretrained Foundation Models

arXiv:2506.11042v2 Announce Type: replace Abstract: Parameter-efficient fine-tuning (PEFT) has emerged as a resource-efficient strategy for adapting Pretrained Foundation Models (PFMs) by learning a small number of task-specific updates $\Delta W$. Existing methods often learn $\Delta W$ largely independently of pretrained weights $W_0$, or exploit $W_0$ mainly through initialization or simple reparameterization. To further leverage the structural information encoded in $W_0$, we propose...

arXiv CS 5d ago

Deep learning four decades of human migration

Abstract Human migration is a fundamental driver of global demographic change, shaping population structure, labour markets and social policy across countries1,2,3. Although long-term migration patterns are often linked to economic development4, they can shift rapidly in response to shocks such as conflict, environmental crises and political change5. Despite its importance, migration remains difficult to measure consistently: existing data are sparse, concentrated in high-income settings and...

Nature 21h ago

DDOR: Delta Debugging for Explainable Overrefusal Testing and Repair

Announce Type: new Abstract: While safety alignment and guardrails help large language models (LLMs) avoid harmful outputs, they can also induce overrefusal, i.e., unwarranted rejection of benign queries that merely appear risky. We present DDOR (Delta Debugging for OverRefusal), a fully automated and explainable framework for overrefusal testing and repair in a black-box setting, where only model inputs and outputs are accessible and internal safety mechanisms remain opaque. DDOR applies...

arXiv CS 7d ago