Claude Sonnet
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
LLM Consortium for Software Design Refinement: A Controlled Experiment on Multi-Agent Collaboration Topologies
Announce Type: new Abstract: We present a controlled experiment evaluating 12 multi-agent LLM collaboration topologies for software architecture design. Using a $2\times2\times2$ factorial design (Authority $\times$ Roles $\times$ Dynamics), we conducted 520 experimental runs across 8 design tasks of varying complexity, with 5 repetitions each. Designs were evaluated on a 12-dimensional rubric by three independent automated evaluators (GPT-OSS 120B, Claude Opus 4.6, Claude Sonnet 4.6).
From Outliers to Errors: Auditing Pali-to-English LLM Translations with Multi-Reference Adjudication
arXiv:2606.01136v1 Announce Type: new Abstract: Single-score translation metrics can conflate legitimate variation with error, a problem especially acute for classical languages where multiple defensible English renderings of the same passage coexist. We audit Pali-to-English output from four flagship large language models (LLMs): GPT-5.5, Claude Sonnet 4.6, Gemini 3.1 Pro, and Grok 4.3, on 1,700 passages from the Pali Canon, using three established human translations by Bhikkhu Sujato,...
Gender-Dependent Diagnostic Substitution in LLM Medical Triage: Same Symptoms, Unequal Urgency
Announce Type: new Abstract: We investigate whether large language models produce different medical triage recommendations for identical neurological symptoms when only the patient's stated gender and age vary. Using three model families--Gemini 3.5 Flash, Claude Sonnet 4.6, and GPT-5.4-mini--we present a standardized symptom profile (persistent headache, blurred vision, morning nausea, visual disturbances) across seven demographic conditions: three age groups (25, 38, 65) x two genders...
Where Do Large Language Models Fail on Competitive Programming? A Taxonomy of Failures by Algorithm Type and Difficulty Rating
arXiv:2606.05228v1 Announce Type: new Abstract: Large language models (LLMs) demonstrate increasing proficiency on competitive programming benchmarks, yet technical reports predominantly publish aggregate pass rates, obscuring domain-specific vulnerabilities. We present a systematic empirical study of LLM failure patterns using a balanced taxonomy of 315 Codeforces problems across seven algorithm categories and three difficulty tiers. We evaluate GPT-4o and Claude Sonnet 4.6 under strict...
Scaffold Effects on GAIA: A Controlled Comparison
Announce Type: new Abstract: Published agent capability scores conflate what a model can do with what its scaffold lets it do, and the magnitude of this elicitation gap is not well characterized under controlled conditions. This study executes a pre-registered controlled comparison of three scaffolds (ReAct, a Planner-Actor-Rater multi-agent design, and planner-then-executor) across five models from three providers (Claude Opus 4.7, Sonnet 4.6, Haiku 4.5; Gemini 3.1 Pro Preview; GPT-5.5) on...
How we index images for RAG
How we index images for RAG Reading the screenshots, diagrams and tables in technical documentation for LLMs by Matteo Bortoletto Kapa builds AI assistants that answer questions from technical documentation. The knowledge bases we process hold millions of images: screenshots, architecture diagrams, circuit schematics, annotated UI walkthroughs. We spent several months working out how to make them useful in our RAG pipeline.
Long Live Fine-Tuning: Task-Specific Transformers Outperform Zero-Shot LLMs for Misinformation Response Classification on Reddit
Announce Type: new Abstract: As large language models (LLMs) become default tools for online information verification, an implicit assumption follows them: that scale and general capability are sufficient for nuanced classification of misinformation discourse. We test this assumption directly on 900 Reddit comments spanning three PolitiFact-verified misinformation claims (environment, health, immigration), labelled as belief (propagates the claim), fact-check (corrects it), or other. We...
I built a vulnerable app and spent $1,500 seeing if LLMs could hack it
I built a vulnerable app and spent $1,500 seeing if LLMs could hack it As a part of my work I do security research for various apps and websites. I wanted to see if LLMs could reproduce a common class of exploits I’ve found in multiple apps. I made a fake React Native app in Expo and a backend in Python.
Netflix wiz creates app to slash AI bills, then open sources it
As the COOs from both Uber and Microsoft recently learned, encouraging company engineers to use AI aggressively can lead to hefty usage bills, perhaps even offsetting all the gains from laying off employees. The AI bills at Netflix may not be so eye-popping thanks to company senior engineer Tejas Chopra, who has created software to prune agent instructions, as measured in tokens, before they hit the LLM. Chopra has estimated that as much as 90% of tokens are redundant to the giant thinking...
Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses
arXiv:2606.06788v1 Announce Type: new Abstract: Evaluations of large language models (LLMs) in scientific information seeking tasks have become increasingly use-centric, such as conducting live or multi-turn evaluations with real users. These evaluations still assume a single, static chat interface, but as models are integrated into new interfaces, evaluations must shift to incorporate interface-specific criteria. We propose a new evaluation framework based on a formative study with $16$...