Home Knowledge Base Design and Evaluation of Agent

Design and Evaluation of Agent

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

SciVisAgentSkills: Design and Evaluation of Agent Skills for Scientific Data Analysis and Visualization

Announce Type: new Abstract: Recent advances in agentic visualization have enabled the translation of natural language into executable scientific visualization (SciVis) workflows. While general-purpose coding agents show strong capabilities, they often lack the tool-specific expertise required for SciVis tasks. In this work, we present SciVisAgentSkills, a collection of reusable agent skills that augment coding agents for scientific data analysis and visualization by encoding environment...

arXiv CS 5d ago

Design and Evaluation of Multi-Agent AI Oracle Systems for Prediction Market Resolution

Announce Type: new Abstract: Prediction markets aggregate collective intelligence to forecast uncertain events, but their utility depends on reliable outcome resolution. Existing oracle systems tradeoff fast but brittle automation against accurate but costly human arbitration. Single-LLM oracles achieve meaningful accuracy but inherit all failure modes of their underlying model with no self-correction mechanism.

arXiv CS 9d ago

Autonomous agentic design for photonics

new Abstract: We introduce an automated, agent-driven approach to the design of photonic devices. We instruct large language models (LLMs) to solve photonic design problems, given access to software tools for performance evaluation (through numerical simulations) and quantitative acceptance criteria (e.g., fabrication rules, geometric constraints, physical-consistency checks). Within this context, agents run autonomous design loops (propose, simulate, evaluate, iterate) and generate devices...

arXiv Physics 8d ago

Online Agent-as-a-Judge: Situation-Generating Evaluation for Interactive Agents

arXiv:2606.08200v1 Announce Type: new Abstract: Evaluating LLM-powered interactive social agents is challenging because socially relevant behaviors depend not only on isolated outputs, but also on prior interactions, social roles, and downstream actions. Existing methods typically allow a target agent to act freely in an environment and then score the resulting trajectory. However, this passive setup can miss capabilities that only become observable under specific social circumstances; for...

arXiv CS 1d ago

Cross-Vendor Sola ISPM Benchmark: Evaluating Agentic AI for Federated Identity Security Reasoning

Announce Type: new Abstract: The rapid proliferation of multi-cloud and SaaS platforms has transformed Identity Security Posture Management (ISPM) into a fundamentally cross-vendor challenge: critical misconfigurations and privilege escalation paths increasingly span multiple identity providers, infrastructure layers, and authentication systems never designed to interoperate. Existing evaluations focus on isolated single-platform environments and provide no means to assess whether an AI...

arXiv CS 7d ago

Entropy-Based Evaluation of AI Agents: A Lightweight Framework for Measuring Behavioral Patterns

Announce Type: new Abstract: AI agents are commonly evaluated using task success, reward, latency, and cost. These metrics are useful, but they often miss important aspects of agent behavior: whether an agent explores too much, repeats itself too rigidly, uses tools effectively, reduces uncertainty over time, or remains robust across repeated runs. This paper proposes Entropy-Based Evaluation of AI Agents (EEA), a lightweight framework for measuring agent behavior through entropy.

arXiv CS 5d ago

The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

Announce Type: new Abstract: Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether models can autonomously develop agent systems. We introduce the Meta-Agent Challenge (MAC), an evaluation framework designed to test the capacity of frontier models for autonomous agent development.

arXiv CS 6d ago

Entropy-Based Evaluation of AI Agents: A Lightweight Framework for Measuring Behavioral Patterns

arXiv:2606.05872v2 Announce Type: replace Abstract: AI agents are commonly evaluated using task success, reward, latency, and cost. These metrics are useful, but they often miss important aspects of agent behavior: whether an agent explores too much, repeats itself too rigidly, uses tools effectively, reduces uncertainty over time, or remains robust across repeated runs. This paper proposes Entropy-Based Evaluation of AI Agents (EEA), a lightweight framework for measuring agent behavior...

arXiv CS 1d ago

RDA: Reward Design Agent for Reinforcement Learning

Announce Type: new Abstract: Reinforcement learning has enabled the acquisition of impressive robotic skills, but typically requires hand-crafted reward functions that are slow to design and difficult to align with human intentions. Recent work, such as Eureka, automates reward design by using an LLM to iteratively generate and refine reward code from task descriptions. However, they rely on coarse feedback signals such as success rate, which provide little semantic insight into the learned...

arXiv CS 8d ago

Evaluating agentic AI for biological discovery in autonomous and copilot settings

Advances in large language models (LLMs)-based artificial intelligence (AI) agents have improved their ability to execute structured analytical workflows, including standard bioinformatic pipelines for biological discovery. However, computational biology rarely consists of deterministic pipeline execution alone. Biological datasets are heterogeneous and noisy, and meaningful discovery often requires open-ended hypothesis generation and iterative reasoning over multimodal evidence.

bioRxiv 1d ago