Semantic Confidence Aggregation
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Evaluating and Calibrating LLM Confidence on Questions with Multiple Correct Answers
Announce Type: replace Abstract: Confidence calibration is essential for making large language models (LLMs) reliable, yet existing training-free methods have been primarily studied under single-answer question answering. In this paper, we show that these methods break down in the presence of multiple valid answers, where disagreement among equally correct responses leads to systematic underestimation of confidence.
Collective Hallucination in Multi-Agent LLMs:Modeling and Defense
arXiv:2606.07941v1 Announce Type: new Abstract: Hallucinations in large language models (LLMs) create heightened risks in multi-agent settings, where recursive agent interactions can propagate, reinforce, and amplify unsupported claims. This paper models hallucination as a system-level, time-evolving process across a network of interacting LLM agents, where nodes represent agents and edges encode information exchange. The proposed formulation captures how hallucinated claims diffuse through...
When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference
arXiv:2606.08098v1 Announce Type: new Abstract: Majority voting over sampled answers is the dominant unsupervised aggregator for multi-sample LLM inference. We show that piping the signals every sample carries into a delegation-based aggregator (Propagational Proxy Voting, PPV) yields an unsupervised consensus rule that beats majority on MMLU-Pro by +1.5 pp overall and +2.24 pp on the non-trivial subset (paired McNemar p ~ 1.0e-14, n = 8,099).
GVC-Seg: Training-Free 3D Instance Segmentation via Geometric Visual Correspondence
Announce Type: new Abstract: Accurate 3D instance segmentation in point cloud data is critical for machine vision applications. Recent advancements leverage multiple pre-trained foundation models to generate 3D proposals, followed by the application of proposal aggregation methods, which significantly enhance performance. However, they often produce sub-optimal results due to inherent variations in confidence levels across different segmentation models, resulting in a bias toward the model...
Improving User Experience with Personalized Review Ranking and Summarization
arXiv:2601.05261v2 Announce Type: replace Abstract: Online consumer reviews are important decision-support resources in e-commerce, yet the increasing volume of reviews often creates information overload and makes it difficult for users to identify content that matches their individual preferences. Existing review-ranking approaches commonly rely on aggregate signals such as star ratings, helpfulness votes, or recency, which may not reflect user-specific interests. This paper proposes a...
FrontierCode
Introducing FrontierCode Raising the bar from correctness to quality Today’s coding benchmarks have established that models can write correct code. But as AI-generated code becomes the dominant path to production, correctness is now table stakes. The question that we should be asking is: can models actually write good code?