Home › Knowledge Base › Measuring Agents in Production

Measuring Agents in Production

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Measuring Agents in Production

Announce Type: replace Abstract: LLM-based agents already operate in production across many industries, yet we lack an understanding of what technical methods make deployments successful. We present the first systematic study of Measuring Agents in Production, MAP, using first-hand data from agent developers. We conducted 20 case studies via in-depth interviews and surveyed 86 deployed systems practitioners across 26 domains.

arXiv CS 2d ago

Will the Agent Recuse Itself? Measuring LLM-Agent Compliance with In-Band Access-Deny Signals

arXiv:2606.06460v1 Announce Type: new Abstract: As autonomous LLM agents increasingly hold real credentials and operate infrastructure without a human in the loop, operators have no standard way to tell an agent that a resource is off-limits. Access controls either let the agent in (it has valid credentials) or hard-fail it (indistinguishable from any other client). We propose a third mode: a lightweight, published in-band deny signal -- the Recuse Signal -- that a server emits over a...

arXiv CS 5d ago

Show HN: Nucleus – A security-hardened, Nix-native container runtime

Extremely lightweight, security-hardened, declarative container runtime for agents and production services Nucleus is a minimalist container runtime for Linux. It provides isolated execution environments using Linux kernel primitives without the overhead of traditional container runtimes. For production services, it is designed around a fully declarative model: Nix builds the root filesystem, the NixOS module declares the service, and Nucleus mounts a pinned, reproducible closure at runtime.

Hacker News 20h ago

AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations

Announce Type: new Abstract: Indirect prompt injection in tool-use agents is a concrete production threat: LLM agents read from integrations (third-party services such as Gmail, Salesforce, or Jira accessed through tool calls) whose response content the user neither writes nor controls. Existing benchmarks under-measure the threat: most cover only a handful of integrations with the same attack payload replayed across runs, and open-source guards are trained on chat-style data rather than...

arXiv CS 8d ago

AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations

arXiv:2606.02240v2 Announce Type: replace Abstract: Indirect prompt injection in tool-use agents is a concrete production threat: LLM agents read from integrations (third-party services such as Gmail, Salesforce, or Jira accessed through tool calls) whose response content the user neither writes nor controls. Existing benchmarks under-measure the threat: most cover only a handful of integrations with the same attack payload replayed across runs, and open-source guards are trained on...

arXiv CS 7d ago

Agentic AI hype races ahead as enterprises remain stuck in pilot mode

Three-quarters of enterprise leaders say they're adopting agentic AI, but only a small minority have managed to move beyond pilots and into meaningful production deployments, according to Forrester. That won't stop vendors from slapping "agentic" onto every product brochure they can find, but the analyst's assessment is that most organizations remain stuck somewhere between experimentation and actual business value. Agentic AI has reached an important milestone in 2026, says Forrester:...

The Register 5d ago

Cognizant CEO calls AI tokens metric wrong; says it shouldn't be equated to productivity

Cognizant CEO Ravi Kumar S. has pushed back against the growing reliance on ‘AI tokens’ as a measure of productivity, calling the metric misleading and‘a vanity exercise’. According to a report by Fortune, speaking at Fortune’s COO Summit in Scottsdale, Arizona, Kumar emphasised that the companies should focus on outcmes rather than token consumption, a practice he feels has distorted how the industry evaluates artificial intelligence. For months, tech leaders inclduing Sam Altman of OpenAI...

Times of India 8d ago

Building Customer Support AI Agents at 100M-User Scale: An Evaluation-Driven Framework

arXiv:2606.08867v1 Announce Type: new Abstract: The rapid rise in LLM capabilities has made AI agents increasingly viable across a broad range of tasks. Among the most promising applications is building production-ready customer-facing agents, a challenge that demands coordinated excellence in evaluation methodology, context engineering, training, and online measurement.

arXiv CS 1d ago

RECAP: Regression Evaluation for Continual Adaptation of Prompts

arXiv:2606.06698v1 Announce Type: new Abstract: Production agentic systems routinely face evolving constraints and must comply from the very next interaction. Scenarios like a tool-call notification changing a compliance threshold or a policy update adding disclosure requirements fit this criteria, having close to no room for errors in production. This proactive adaptation setting is common in deployment, but absent from current benchmarks, which assume either static constraint sets or...

arXiv CS 2d ago

RECAP: Regression Evaluation for Continual Adaptation of Prompts

arXiv:2606.06698v2 Announce Type: replace Abstract: Production agentic systems routinely face evolving constraints and must comply from the very next interaction. Scenarios like a tool-call notification changing a compliance threshold or a policy update adding disclosure requirements fit this criteria, having close to no room for errors in production. This proactive adaptation setting is common in deployment, but absent from current benchmarks, which assume either static constraint sets or...

arXiv CS 1d ago