Home › Knowledge Base › Modular Benchmark

Modular Benchmark

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

ADRA-Bank: A Modular Benchmark for Academic Deep Research Agents

arXiv:2512.00986v3 Announce Type: replace Abstract: A surge in academic publications calls for automated deep research (DR) systems, but accurately evaluating them is still an open problem. First, existing benchmarks often focus narrowly on retrieval while neglecting high-level planning and reasoning. Second, existing benchmarks favor general domains over the academic domains that are the core application for DR agents.

arXiv CS 8d ago

MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents

arXiv:2512.12634v4 Announce Type: replace Abstract: Mobile GUI Agents, AI agents capable of interacting with mobile applications on behalf of users, have the potential to transform human computer interaction. However, current evaluation practices for GUI agents face two fundamental limitations.

arXiv CS 8d ago

Zorawar tank: The made-in-India war machine built to dominate China on the LAC

The rollout of the Zorawar light tank from the AM Naik Heavy Engineering Complex marked a watershed moment for India’s defence industry. Developed in just 19 months, it is the country’s first indigenous light tank designed for high‑altitude warfare in the Himalayas. Zorawar was conceived during the tensions with China along the Line of Actual Control and as a counter to the Type 15 tanks the Indian Army faced during the stand-off.

Times of India 5d ago

GFFMERGE: Efficient Merging of Graph Neural Force Fields and Beyond

arXiv:2606.03232v1 Announce Type: new Abstract: Graph Neural Networks (GNNs) have revolutionized Neural Force Fields for atomistic simulations, achieving near-quantum accuracy at reduced cost, yet adapting these models to new chemical systems requires expensive retraining of foundation models. Inspired by model merging in vision and language processing, we introduce GFFMERGE, the first principled framework for closed-form model merging in GNNs. We exploit the linear structure of...

arXiv CS 7d ago

Structured Prompt Optimization Meets Reinforcement Learning for Global and Local Interpretability over Complex Text

arXiv:2605.29076v2 Announce Type: replace Abstract: LLMs have advanced text classification, yet existing paradigms face a trade-off: supervised (label only) fine-tuning is scalable but offers limited reasoning on complex text and lacks broader model transparency, while discrete prompt optimization offers human-readable instructions but struggles with performance and scalability. We introduce eXTC (eXplainable Text Classifier) with three progressive stages: (1) learning a Standard Operating...

arXiv CS 6d ago

Learning Association via Track-Detection Matching for Multi-Object Tracking

arXiv:2512.22105v2 Announce Type: replace Abstract: Multi-object tracking aims to maintain object identities over time by associating detections across video frames. Two dominant paradigms exist in literature: tracking-by-detection methods, which are computationally efficient but rely on handcrafted association heuristics, and end-to-end approaches, which learn association from data at the cost of higher computational complexity. We propose Track-Detection Link Prediction (TDLP), a...

arXiv CS 6d ago

MAVEN: Improving Generalization in Agentic Tool Calling

Announce Type: new Abstract: Generalization across agentic tool-calling environments remains a central challenge for reliable agentic reasoning systems. Although large language models achieve strong results on individual benchmarks, their ability to compose reasoning strategies, preserve intermediate states, and coordinate tools across domains remains underexplored. We present MAVEN (Modular Agentic Verification and Execution Network), a lightweight symbolic reasoning scaffold for structured...

arXiv CS 9d ago

OgBench: A Framework for Evaluating Graph Neural Networks on Omics Data

Announce Type: replace Abstract: Graph Neural Networks (GNNs) have become the dominant framework for inductive graph-level learning. Yet most benchmarks focus on the regime $n \gg p$, where the number of graphs $n$ greatly exceeds the number of nodes per graph $p$. This overlooks biological domains such as omics, which operate in the opposite $n \ll p$ regime, characterized by large graphs of genes, transcripts, or proteins across few patient samples. This raises the question: \textit{how do...

arXiv CS 8d ago

ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning

Announce Type: new Abstract: Test-time compute (TTC) scaling has emerged as a powerful paradigm for improving large language model (LLM) reasoning by allocating additional compute during inference, e.g., via multi-sample generation and verifier-based reranking. Existing TTC scaling strategies and reasoning scorers remain fragmented, evaluated under inconsistent protocols, and are rarely analyzed through the lens of quality-cost trade-offs. We introduce ThinkBooster, a unified framework for...

arXiv CS 2d ago

ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning

arXiv:2606.06915v2 Announce Type: replace Abstract: Test-time compute (TTC) scaling has emerged as a powerful paradigm for improving large language model (LLM) reasoning by allocating additional compute during inference, e.g., via multi-sample generation and verifier-based reranking. Existing TTC scaling strategies and reasoning scorers remain fragmented, evaluated under inconsistent protocols, and are rarely analyzed through the lens of quality-cost trade-offs. We introduce ThinkBooster, a...

arXiv CS 1d ago