Home › Knowledge Base › Benchmarking Local

Benchmarking Local

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Benchmarking Local LLMs for Natural-Language-to-SQL Querying in Biopharmaceutical Manufacturing: An Empirical Benchmark on Consumer-Grade Hardware

arXiv:2606.01338v1 Announce Type: new Abstract: Biopharmaceutical manufacturing organizations operate under regulatory frameworks such as FDA guidance, EU Good Manufacturing Practice (GMP), and the EU AI Act, which can restrict the use of cloud-based artificial intelligence systems. Locally deployed large language models (LLMs) offer a privacy-preserving alternative, but their suitability for pharmaceutical manufacturing tasks remains underexplored. This study evaluates four open-source LLMs...

arXiv CS 8d ago

Translation Analytics for Freelancers II: Benchmarking Local LLMs for Confidential Translation Workflows

Announce Type: new Abstract: Building on our previous work, this paper develops practical, low-barrier methods for freelance translators and smaller language service providers to evaluate translation technologies using rigorous yet accessible analytic methods. Here we address a high-stakes, specialized need: offline translation for confidentiality-sensitive domains in which privacy constraints preclude the use of cloud-based engines and commercial LLMs. We expand the Reeve Foundation...

arXiv CS 9d ago

Overestimating zero-shot fitness prediction: Broad benchmarks mask local failures and practical limitations

Deep learning models have emerged as promising tools for navigating mutational landscapes in protein engineering. These models can be used to predict mutation fitness without the need for task-specific training, a process known as zero-shot prediction. However, their practical utility remains only partially characterized.

bioRxiv 3d ago

Impostor: An Agent-Curated Benchmark for Realistic AIGC Manipulation Localization

Announce Type: new Abstract: Recent advances in generative image editing have improved the realism and controllability of localized image manipulation, raising new challenges for image manipulation detection and localization (IMDL). However, existing IMDL benchmarks still have limitations in visual realism, manipulation diversity, and generator coverage, making it difficult to reflect recent trends in image manipulation. To address these limitations, we introduce Impostor, a high-quality...

arXiv CS 6d ago

ERGeoBench:A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models

arXiv:2605.31251v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) have shown strong potential as embodied agents, yet embodied geo-localization remains underexplored due to the lack of fine-grained evaluation. We introduce ERGeoBench, a diagnostic benchmark for vision-driven embodied geo-localization. ERGeoBench evaluates models under three progressive settings -- single-view, panorama-view, and embodied-view -- where agents may actively acquire observations through...

arXiv CS 9d ago

LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services

arXiv:2512.07436v3 Announce Type: replace Abstract: Recent advances in large reasoning models LRMs have enabled agentic search systems to perform complex multi-step reasoning across multiple sources. However, most studies focus on general information retrieval and rarely explores vertical domains with unique challenges. In this work, we focus on local life services and introduce LocalSearchBench, which encompass diverse and complex business scenarios.

arXiv CS 8d ago

SagaQA: A Multi-hop Reasoning Benchmark for Long-form Narrative Understanding in TV Series

Announce Type: new Abstract: We introduce SagaQA, a long-form video benchmark for multi-hop reasoning over full-length TV series. Existing video reasoning benchmarks often emphasize local understanding of adjacent frames or clips. SagaQA addresses this gap by requiring high-level comprehension of extended multimodal narratives in entire TV shows.

arXiv CS 7d ago

Automated IEP Generation from Traditional Chinese Parent-Teacher Interviews via Corpus-Grounded Feature Diffusion

arXiv:2606.09603v1 Announce Type: new Abstract: Writing Individualized Education Programs (IEPs) is a high-labor, knowledge-intensive document burden; English-language research has demonstrated that generative AI can significantly reduce drafting time, yet automated IEP generation in Traditional Chinese remains virtually unexplored due to domain data scarcity, strict privacy regulations, and the absence of local evaluation benchmarks. We propose a low-resource fine-tuning pipeline centered...

arXiv CS 1d ago

When Do Local Score Models Extrapolate Across Size? A Diagnostic Theory and Benchmark

Announce Type: new Abstract: Scientific generative modeling often requires size transfer, where models trained on small systems are evaluated on larger ones. While translation-invariant architectures enable this evaluation, we show that architectural locality alone does not guarantee stable size extrapolation. Instead, stable extrapolation is governed by the quasi-locality of the Gaussian-smoothed score.

arXiv CS 1d ago

RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent Sessions

arXiv:2606.03889v1 Announce Type: new Abstract: Agent benchmarks should reflect what users actually ask deployed agents to do, yet existing benchmarks often miss key realism properties of real developer-agent sessions. We introduce RealClawBench, a live benchmark framework built from real OpenClaw sessions to capture the distribution, diversity, and real-world difficulty of deployed agent use. Real user requests are challenging to benchmark because they often depend on local execution...

arXiv CS 7d ago