Holistic Evaluation to Structured
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
From Holistic Evaluation to Structured Criteria: Rubrics Across the Evolving LLM Landscape
Announce Type: new Abstract: As Large Language Models (LLMs) advance toward open-ended autonomous agents, the mechanisms used to evaluate and guide their behavior must evolve accordingly. This work introduces the rubric as a unifying framework capturing this evolution, characterizing rubrics as a dynamic response to successive LLM paradigm shifts that recurs across otherwise independent efforts in evaluation, reinforcement learning, and safety alignment.
Graph-GRPO: Dependency-Aware Credit Assignment for Generative E-commerce Search Relevance
arXiv:2605.31003v1 Announce Type: new Abstract: Search relevance modeling is a core task in e-commerce search systems, assessing how well a user query matches candidate products. Rather than relying on a single holistic matching signal, relevance judgment often requires structured reasoning over query understanding, product understanding, and facet-level matching. With large language models (LLMs), this process is increasingly formulated as chain-of-thought (CoT) reasoning and optimized with...
Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
arXiv:2606.09401v1 Announce Type: new Abstract: Recent work has applied differential privacy (DP) to adapt large language models (LLMs) for sensitive applications, offering theoretical guarantees. However, its practical effectiveness remains unclear, partly due to LLM pretraining, where overlaps and interdependencies with adaptation data can undermine privacy despite DP efforts. To analyze this issue in practice, we investigate privacy risks under DP adaptations in LLMs using...
Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation
arXiv:2605.30000v2 Announce Type: replace Abstract: Front-end web code has become a core product surface for every frontier LLM release, yet evaluating these interactive applications at development speed remains costly because human-judged leaderboards like Arena do not scale. Existing automated proxies typically lean on reference implementations, test suites, or rigid checklists, and tend to miss the reasoned synthesis a human reviewer performs over a live session. We articulate a new...
TeleSWEBench: A Commit-Driven Benchmark for Evaluating LLM-Powered Software Engineering in Telecommunications
Announce Type: new Abstract: With the telecommunications field embracing zero touch management alongside novel O-RAN and AI-RAN frameworks, contemporary telecom networks now function as immensely intricate and heavily softwareized codebases. While automated software engineering (ASE) tools and Software Engineering (SWE) Agents hold the potential to alleviate the critical code generation bottleneck in this domain, their ability to navigate and modify specialized, mathematically rigorous...
In the age of 'finfluencers' and AI, do financial advisers still matter?
In the age of 'finfluencers' and AI, do financial advisers still matter? Financial advice becomes more valuable as your life decisions become more complex. But until then, not everyone necessarily needs a financial adviser.
AI is blowing up music. How should the Grammys handle it?
Today I’m talking with Harvey Mason Jr., who is CEO of the Recording Academy — that’s the outfit that puts on the Grammy Awards. I last talked to Harvey in 2024, when it was obvious that generative AI would upend the music industry, but still not exactly clear how that would happen. Well, it’s been 18 months since that conversation, and you’re going to hear Harvey say that AI is now “omnipresent” in music production. And Harvey knows what he’s talking about — he is himself a legendary...
Breaking the Likelihood Trap: Consistent Generative Recommendation with Graph-structured Model
arXiv:2510.10127v3 Announce Type: replace Abstract: Reranking, as the final stage of recommender systems, plays a crucial role in determining the final exposure, directly influencing user experience. Recently, generative reranking has gained increasing attention for formulating reranking as a holistic sequence generation task, implicitly modeling complex dependencies among items. However, most existing methods suffer from the likelihood trap, where high-likelihood sequences are often...
Harnessing Generalist Agents for Contextualized Time Series
Announce Type: new Abstract: Time series are often embedded in rich contexts that are essential for holistic modeling. Moreover, real-world practitioners often require end-to-end workflows for analyzing temporal dynamics, where widely studied tasks such as forecasting are only one step in a broader solution loop. While generalist AI agents offer a promising interface for such workflows under complex contexts, they still operate primarily in textual spaces that are not fully aligned with...
The Frame Problem
The Frame Problem To most AI researchers, the frame problem is the challenge of representing the effects of action in logic without having to represent explicitly a large number of intuitively obvious non-effects. But to many philosophers, the AI researchers' frame problem is suggestive of wider epistemological issues. Is it possible, in principle, to limit the scope of the reasoning required to derive the consequences of an action?