Home › Knowledge Base › Automated Framework to Evaluate

Automated Framework to Evaluate

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks

arXiv:2604.01039v2 Announce Type: replace Abstract: System Instructions in Large Language Models (LLMs) are commonly used to enforce safety policies, define agent behavior, and protect sensitive operational context in agentic AI applications. These instructions may contain sensitive information such as API credentials, internal policies, and privileged workflow definitions, making system instruction leakage a critical security risk highlighted in the OWASP Top 10 for LLM Applications....

arXiv CS 1d ago

VESTA: A Fully Automated Scenario Generation and Safety Evaluation Framework for LLM Agents

arXiv:2606.08531v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly evolving from simple text-based interaction systems into LLM agents that can maintain memory, use tools, access external environments, and execute tasks. As their capabilities and autonomy expand, the safety risks they face also become more diverse. Existing evaluations often rely on manually written scenarios, static prompts, or final-output judgments, making it difficult to capture the diverse...

arXiv CS 1d ago

AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research

Announce Type: replace Abstract: Language model agents are increasingly used to automate scientific research, yet evaluating their scientific contributions remains a challenge. A key mechanism to obtain such insights is through ablation experiments. To this end, we introduce AblationBench, a benchmark suite for evaluating agents on ablation planning tasks in empirical AI research.

arXiv CS 8d ago

AlphaEval: A Comprehensive and Efficient Evaluation Framework for Formula Alpha Mining

Announce Type: replace Abstract: Formula alpha mining, which generates predictive signals from financial data, is critical for quantitative investment. Although various algorithmic approaches-such as genetic programming, reinforcement learning, and large language models-have significantly expanded the capacity for alpha discovery, systematic evaluation remains a key challenge. Existing evaluation metrics predominantly include backtesting and correlation-based measures.

arXiv CS 7d ago

LATTEArena: An Evaluation Framework for LLM-powered Tabular Feature Engineering (Extended Version)

Announce Type: new Abstract: Feature engineering remains essential for tabular data analysis, and Large Language Models (LLMs) have emerged as a promising paradigm for automating this process, giving rise to LLM-powered AuTomated Tabular feature Engineering (LATTE). However, the absence of standardized platforms prevents fair, cost-aware comparisons. Furthermore, complex methodological designs obscure the specific contributions of individual components; for example, although LFG integrates...

arXiv CS 1d ago

PDE-Agents: An LLM-Orchestrated Multi-Agent Framework for Automated Finite Element Simulations with Knowledge Graph-Augmented Reasoning

Announce Type: new Abstract: We present PDE-Agents, a multi-agent ecosystem that automates the full lifecycle of partial differential equation (PDE) / finite element method (FEM) simulations through natural-language interaction. Three specialist large language model (LLM) agents (Simulation, Analytics, Database) are orchestrated via a LangGraph supervisor, with a local open-source LLM stack (Qwen3-Coder-Next, Llama 4 Scout) on dual NVIDIA RTX PRO 6000 GPUs. The architecture is...

arXiv Physics 1d ago

QoEReasoner: An Agentic Reasoning Framework for Automated and Explainable QoE Diagnosis in RANs

arXiv:2606.01925v1 Announce Type: new Abstract: Diagnosing Quality-of-Experience (QoE) degradations in operational Radio Access Networks (RANs) is a critical but notoriously complex task, traditionally requiring labor-intensive expert analysis over high-dimensional, cross-layer telemetry. While Large Language Models (LLMs) offer unprecedented reasoning capabilities, they are fundamentally unsuited for raw RANs troubleshooting: they fail at numeric time-series analysis, hallucinate...

arXiv CS 8d ago

QoEReasoner: An Agentic Reasoning Framework for Automated and Explainable QoE Diagnosis in RANs

Announce Type: replace Abstract: Diagnosing Quality-of-Experience (QoE) degradations in operational Radio Access Networks (RANs) is a critical but notoriously complex task, traditionally requiring labor-intensive expert analysis over high-dimensional, cross-layer telemetry. While Large Language Models (LLMs) offer unprecedented reasoning capabilities, they are fundamentally unsuited for raw RANs troubleshooting: they fail at numeric time-series analysis, hallucinate protocol-violating causal...

arXiv CS 7d ago

MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery

arXiv:2606.06473v1 Announce Type: new Abstract: Large language model (LLM) agents are increasingly applied to long-horizon tasks such as scientific discovery and machine learning engineering (MLE), where sustained self-evolution becomes a key capability. However, existing MLE agents suffer from inter-branch information isolation, memoryless search, and lack of hierarchical control, which together hinder long-horizon optimization. We present MLEvolve, an LLM-based self-evolving multi-agent...

arXiv CS 5d ago

ZERO-APT: A Closed-Loop Adversarial Framework for LLM-Driven Automated Penetration Testing under Intelligent Defense

arXiv:2606.05567v1 Announce Type: new Abstract: LLM-driven automated penetration testing agents are typically evaluated against static targets that neither detect nor respond to attacks, so their behavior under intelligent defense remains untested. The causal consistency of multi-step attack chains likewise hinges on unstable LLM reasoning, and agent decisions remain opaque to human analysts. These three shortcomings, in realism, consistency, and auditability, are usually patched in isolation.

arXiv CS 5d ago