Home › Knowledge Base › LLM Outputs

LLM Outputs

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Neuro-Symbolic Verification of LLM Outputs for Data-Sensitive Domains (extended preprint)

Announce Type: replace Abstract: LLMs deployed in high-stakes domains face fundamental reliability challenges: hallucinations, inconsistencies, and privacy vulnerabilities introduce unacceptable risks where errors carry legal, financial, or safety consequences. This paper presents a hybrid verification architecture combining formal symbolic methods with neural semantic analysis to provide complementary guarantees for LLM-generated content. This architecture employs logical reasoning for...

arXiv CS 9d ago

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

arXiv:2606.01629v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scalable alternative to human evaluation, yet its reliability in long-form output evaluation remains underexamined: existing meta-evaluation benchmarks focus mainly on short-form outputs. Compared with short-form evaluation, long-form evaluation is not merely a matter of...

arXiv CS 8d ago

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

Announce Type: replace Abstract: As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scalable alternative to human evaluation, yet its reliability in long-form output evaluation remains underexamined: existing meta-evaluation benchmarks focus mainly on short-form outputs. Compared with short-form evaluation, long-form evaluation is not merely a matter of output length;...

arXiv CS 7d ago

Beyond Output Matching: Preserving Internal Geometry in NVFP4 LLM Distillatio

arXiv:2606.05682v1 Announce Type: new Abstract: Demand for low-precision inference, including NVFP4-based approaches, has grown as large language models are increasingly deployed in latency and cost constrained production environments. Quantization-aware distillation (QAD) helps recover accuracy lost under low bit quantization by training a quantized student to match the output distribution of a frozen higher precision teacher via a KL-divergence loss. In this work, we first provide a...

arXiv CS 5d ago

Beyond Output Matching: Preserving Internal Geometry in NVFP4 LLM Distillation

arXiv:2606.05682v2 Announce Type: replace Abstract: Demand for low-precision inference, including NVFP4-based approaches, has grown as large language models are increasingly deployed in latency and cost constrained production environments. Quantization-aware distillation (QAD) helps recover accuracy lost under low bit quantization by training a quantized student to match the output distribution of a frozen higher precision teacher via a KL-divergence loss. In this work, we first provide a...

arXiv CS 2d ago

A Task-Centric Theory for Iterative Self-Improvement with Easy-to-Hard Curricula

Announce Type: replace Abstract: Iterative self-improvement fine-tunes an autoregressive large language model (LLM) on reward-verified outputs generated by the LLM itself. In contrast to the empirical success of self-improvement, the theoretical foundation of this generative, iterative procedure in a practical, finite-sample setting remains limited. We make progress toward this goal by modeling each round of self-improvement as maximum-likelihood fine-tuning on a reward-filtered distribution...

arXiv CS 8d ago

R2-Router: A New Paradigm for LLM Routing with Reasoning

arXiv:2602.02823v2 Announce Type: replace Abstract: As LLMs proliferate with diverse capabilities and costs, LLM routing has emerged by learning to predict each LLM's quality and cost for a given query, then selecting the one with high quality and low cost. However, existing routers implicitly assume a single fixed quality and cost per LLM for each query, ignoring that the same LLM's quality varies with its output length.

arXiv CS 8d ago

Empirical Study for Structured Output Control in LLMs for Software Engineering

Announce Type: new Abstract: LLM-generated outputs in software engineering rarely exist in isolation. They must plug into toolchains, APIs, and data pipelines that impose strict, often organization-specific structural contracts. A semantically correct output that violates the expected format is, from the consuming system's perspective, indistinguishable from a wrong answer, making structural fidelity an operational prerequisite for deploying LLMs in practice.

arXiv CS 1d ago

Failure of contextual invariance in large language models

Announce Type: replace Abstract: Standard evaluation practices assume that large language model (LLM) outputs are stable when prompts are embedded in contextually equivalent discourses. Here, we test this assumption in the setting of gender inference. Using a controlled pronoun selection task, we introduce minimal, theoretically uninformative discourse context and find that this induces large, systematic shifts in model outputs.

arXiv CS 8d ago

Empirical Study on the Characteristics and Evolution of AI-usage in GitHub Repositories: Evidence from Code Comments

arXiv:2606.06843v1 Announce Type: new Abstract: Developers increasingly use AI tools such as ChatGPT, Copilot, and Claude in everyday software workflows, but prior studies often evaluate LLM outputs in isolation rather than examining how developers adapt them in real projects. We analyze 35,361 GitHub code comments that explicitly reference AI use and their associated code blocks. We first open-code 500 unique comments and code blocks to derive a taxonomy of AI-assisted development...

arXiv CS 2d ago