Home Knowledge Base Cookie-Bench

Cookie-Bench

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation

arXiv:2605.30000v2 Announce Type: replace Abstract: Front-end web code has become a core product surface for every frontier LLM release, yet evaluating these interactive applications at development speed remains costly because human-judged leaderboards like Arena do not scale. Existing automated proxies typically lean on reference implementations, test suites, or rigid checklists, and tend to miss the reasoned synthesis a human reviewer performs over a live session. We articulate a new...

arXiv CS 8d ago