Home Knowledge Base Opus

Opus

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Show HN: Formally verified polygon intersection – Opus 4.8 oneshots, prev failed

To my knowledge, this is the first formally verified implementation of an intersection algorithm for polygons. The experience of working with AI agents on this project changed a lot with recent model releases, as I describe in the readme. Opus 4.8 is able to provide algorithm implementation with formal proof in one shot, whereas previous models required me to provide proof strategies in multiple steps.

Hacker News 5d ago

Zot now supports Claude Opus 4.8

The Zot platform has announced that it now supports the Claude Opus

Hacker News 12d ago

Claude Opus 4.8

Please provide the text of the article titled

Hacker News 12d ago

Claude’s new model is more ‘honest’ when it messes up

Anthropic is releasing Claude Opus 4.8 on Thursday, and the company is touting the model's "honesty." According to Anthropic, it trains "all [its] models to be honest - for instance, to avoid making claims that they can't support." But it notes that "a general problem with AI models is that they sometimes jump to conclusions, confidently presenting their work as making progress despite thin evidence." The AI lab claims that early testers have found that Opus 4.8 "is more likely to flag...

The Verge 12d ago

Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation

arXiv:2605.04135v2 Announce Type: replace Abstract: Readers of applied-domain LLM capability evaluations want to know what AI systems can currently do. That literature answers a related, but consequentially different, question: what older, cheaper, less-elicited models could do months or years earlier (a 2026 paper evaluating GPT-3.5 or GPT-4 zero-shot, say, against a frontier of reasoning-capable, tool-using systems like GPT-5.5 Pro and Claude Opus 4.7), often reported with sparse...

arXiv CS 5d ago

Scaffold Effects on GAIA: A Controlled Comparison

Announce Type: new Abstract: Published agent capability scores conflate what a model can do with what its scaffold lets it do, and the magnitude of this elicitation gap is not well characterized under controlled conditions. This study executes a pre-registered controlled comparison of three scaffolds (ReAct, a Planner-Actor-Rater multi-agent design, and planner-then-executor) across five models from three providers (Claude Opus 4.7, Sonnet 4.6, Haiku 4.5; Gemini 3.1 Pro Preview; GPT-5.5) on...

arXiv CS 1d ago

AI agents actively ignore EU law to achieve goals, study finds

The best-performing AI agent, Anthropic’s Claude Opus, only complied with EU law in 54% of cases, according to a Dutch non-profit research firm. Some of the world's most popular AI models are building agents that actively resist EU regulation to get what they want, according to new research. Aithos, a Dutch non-profit researching AI alignment, developed a system called LARA to test 12 popular AI agent models to see whether they would follow key parts of the EU AI Act, which regulates how AI...

Euronews 8d ago

Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents

arXiv:2606.08151v1 Announce Type: new Abstract: Tool-using LLM agents often fail not because relevant text is absent, but because decisive evidence is not selected, compressed, or surfaced at action time. We present CICL, a decision-aware context layer that turns instance evidence into a context graph, routes deterministic, Opus-assisted, Qwen, Codex/GPT-5.5, and Qwen-QLoRA judgments through a shared eight-field schema, scores units by action shift, outcome uplift, necessity, and...

arXiv CS 1d ago

LLM Consortium for Software Design Refinement: A Controlled Experiment on Multi-Agent Collaboration Topologies

Announce Type: new Abstract: We present a controlled experiment evaluating 12 multi-agent LLM collaboration topologies for software architecture design. Using a $2\times2\times2$ factorial design (Authority $\times$ Roles $\times$ Dynamics), we conducted 520 experimental runs across 8 design tasks of varying complexity, with 5 repetitions each. Designs were evaluated on a 12-dimensional rubric by three independent automated evaluators (GPT-OSS 120B, Claude Opus 4.6, Claude Sonnet 4.6).

arXiv CS 8d ago

Anthropic says these topics are too dangerous to let its Fable 5 model talk about

Anthropic Tuesday publicly released Claude Fable 5, its first "Mythos-class" model that it says surpasses its previous frontier Opus models in overall capabilities. But the model's launch today comes with safeguards designed to prevent it from answering queries on topics like cybersecurity, biology, and chemistry, where the company has publicly worried about its potential impact to "uplift" malicious actors. Anthropic says Fable 5 operates on the "same underlying model" as Mythos 5, which is...

Ars Technica 16h ago