Modular Benchmark
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
ADRA-Bank: A Modular Benchmark for Academic Deep Research Agents
arXiv:2512.00986v3 Announce Type: replace Abstract: A surge in academic publications calls for automated deep research (DR) systems, but accurately evaluating them is still an open problem. First, existing benchmarks often focus narrowly on retrieval while neglecting high-level planning and reasoning. Second, existing benchmarks favor general domains over the academic domains that are the core application for DR agents.
MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents
arXiv:2512.12634v4 Announce Type: replace Abstract: Mobile GUI Agents, AI agents capable of interacting with mobile applications on behalf of users, have the potential to transform human computer interaction. However, current evaluation practices for GUI agents face two fundamental limitations.
Zorawar tank: The made-in-India war machine built to dominate China on the LAC
The rollout of the Zorawar light tank from the AM Naik Heavy Engineering Complex marked a watershed moment for India’s defence industry. Developed in just 19 months, it is the country’s first indigenous light tank designed for high‑altitude warfare in the Himalayas. Zorawar was conceived during the tensions with China along the Line of Actual Control and as a counter to the Type 15 tanks the Indian Army faced during the stand-off.
GFFMERGE: Efficient Merging of Graph Neural Force Fields and Beyond
arXiv:2606.03232v1 Announce Type: new Abstract: Graph Neural Networks (GNNs) have revolutionized Neural Force Fields for atomistic simulations, achieving near-quantum accuracy at reduced cost, yet adapting these models to new chemical systems requires expensive retraining of foundation models. Inspired by model merging in vision and language processing, we introduce GFFMERGE, the first principled framework for closed-form model merging in GNNs. We exploit the linear structure of...
Structured Prompt Optimization Meets Reinforcement Learning for Global and Local Interpretability over Complex Text
arXiv:2605.29076v2 Announce Type: replace Abstract: LLMs have advanced text classification, yet existing paradigms face a trade-off: supervised (label only) fine-tuning is scalable but offers limited reasoning on complex text and lacks broader model transparency, while discrete prompt optimization offers human-readable instructions but struggles with performance and scalability. We introduce eXTC (eXplainable Text Classifier) with three progressive stages: (1) learning a Standard Operating...
Learning Association via Track-Detection Matching for Multi-Object Tracking
arXiv:2512.22105v2 Announce Type: replace Abstract: Multi-object tracking aims to maintain object identities over time by associating detections across video frames. Two dominant paradigms exist in literature: tracking-by-detection methods, which are computationally efficient but rely on handcrafted association heuristics, and end-to-end approaches, which learn association from data at the cost of higher computational complexity. We propose Track-Detection Link Prediction (TDLP), a...
MAVEN: Improving Generalization in Agentic Tool Calling
Announce Type: new Abstract: Generalization across agentic tool-calling environments remains a central challenge for reliable agentic reasoning systems. Although large language models achieve strong results on individual benchmarks, their ability to compose reasoning strategies, preserve intermediate states, and coordinate tools across domains remains underexplored. We present MAVEN (Modular Agentic Verification and Execution Network), a lightweight symbolic reasoning scaffold for structured...
OgBench: A Framework for Evaluating Graph Neural Networks on Omics Data
Announce Type: replace Abstract: Graph Neural Networks (GNNs) have become the dominant framework for inductive graph-level learning. Yet most benchmarks focus on the regime $n \gg p$, where the number of graphs $n$ greatly exceeds the number of nodes per graph $p$. This overlooks biological domains such as omics, which operate in the opposite $n \ll p$ regime, characterized by large graphs of genes, transcripts, or proteins across few patient samples. This raises the question: \textit{how do...
ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning
Announce Type: new Abstract: Test-time compute (TTC) scaling has emerged as a powerful paradigm for improving large language model (LLM) reasoning by allocating additional compute during inference, e.g., via multi-sample generation and verifier-based reranking. Existing TTC scaling strategies and reasoning scorers remain fragmented, evaluated under inconsistent protocols, and are rarely analyzed through the lens of quality-cost trade-offs. We introduce ThinkBooster, a unified framework for...
ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning
arXiv:2606.06915v2 Announce Type: replace Abstract: Test-time compute (TTC) scaling has emerged as a powerful paradigm for improving large language model (LLM) reasoning by allocating additional compute during inference, e.g., via multi-sample generation and verifier-based reranking. Existing TTC scaling strategies and reasoning scorers remain fragmented, evaluated under inconsistent protocols, and are rarely analyzed through the lens of quality-cost trade-offs. We introduce ThinkBooster, a...