Home Knowledge Base Agentic Web Bench

Agentic Web Bench

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

SW-$A^2$-Bench: Benchmarking Autonomous Software Agent Generation for Agentic Web

Announce Type: replace Abstract: The Agentic Web is emerging as a paradigm in which autonomous software agents interact with online resources and with each other to accomplish user goals. However, the capacity of Agentic Web is still limited by insufficient autonomous software agent population, which has become a crucial challenge for scaling Agentic Web. In order to alleviate this, we study the task of automatically converting existing code repositories into autonomous software agents via...

arXiv CS 2d ago

Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement

Announce Type: new Abstract: Existing code-generation benchmarks score a single mapping from a complete prompt to a one-shot output. However, real web development is different. Users seldom write a full spec at the start; many requirements only become clear once they look at an intermediate result and react to it.

arXiv CS 5d ago

Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation

arXiv:2605.30000v2 Announce Type: replace Abstract: Front-end web code has become a core product surface for every frontier LLM release, yet evaluating these interactive applications at development speed remains costly because human-judged leaderboards like Arena do not scale. Existing automated proxies typically lean on reference implementations, test suites, or rigid checklists, and tend to miss the reasoned synthesis a human reviewer performs over a live session. We articulate a new...

arXiv CS 8d ago

MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?

arXiv:2606.01993v1 Announce Type: new Abstract: Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. However, such knowledge is often multimodal, heterogeneous, noisy, and implicitly assumes human executors, making it difficult to use directly as the skills required by agents. To bridge the gap between human-oriented guides and agent-executable skills, we formalize this problem as guide-to-skill learning: converting in-the-wild guides...

arXiv CS 8d ago

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

arXiv:2606.03203v1 Announce Type: new Abstract: Computer-use agents could automate repetitive screen-based clinical work, but their reliability in medical graphical user interfaces remains largely unvalidated. Existing benchmarks focus on general web or desktop tasks and underrepresent medical software, which requires domain knowledge, exhibits markedly different UI design from mainstream applications, lacks public testing environments, and demands safety validation beyond task completion....

arXiv CS 7d ago

Google CEO called out 'biggest AI budget problem' of companies world over from IO stage with a solution

Google CEO Sundar Pichai shifted the AI conversation from to economics at this years’s Google I/O conference. Pichai warned that the companies around the world are blowing through their annual AI budgets by May due to runaway token usage. Pichai said the rapid rise of AI agents has created unprecedented costs for enterprises.

Times of India 11d ago

Skill Retrieval Augmentation for Agentic AI

arXiv:2604.24594v3 Announce Type: replace Abstract: As large language models (LLMs) evolve into agentic problem solvers, they increasingly rely on external, reusable skills to handle tasks beyond their native parametric capabilities. In existing agent systems, the dominant strategy for incorporating skills is to explicitly enumerate available skills within the context window. However, this strategy fails to scale: as skill corpora expand, context budgets are consumed rapidly, and the agent...

arXiv CS 1d ago

Show HN: Nucleus – A security-hardened, Nix-native container runtime

Extremely lightweight, security-hardened, declarative container runtime for agents and production services Nucleus is a minimalist container runtime for Linux. It provides isolated execution environments using Linux kernel primitives without the overhead of traditional container runtimes. For production services, it is designed around a fully declarative model: Nix builds the root filesystem, the NixOS module declares the service, and Nucleus mounts a pinned, reproducible closure at runtime.

Hacker News 18h ago

Superintelligence: The Idea That Eats Smart People (2016)

This is the text version of a talk I gave on October 29, 2016, at Web Camp Zagreb [video] (45 mins) SuperintelligenceThe Idea That Eats Smart People | | | In 1945, as American physicists were preparing to test the atomic bomb, it occurred to someone to ask if such a test could set the atmosphere on fire. This was a legitimate concern.

Hacker News 8d ago