LLM Dataset Inference
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
SPECTRA: Revealing the Full Spectrum of User Preferences via Distributional LLM Inference
arXiv:2509.24189v4 Announce Type: replace Abstract: Large Language Models (LLMs) are increasingly used to model user preferences, with the typical output as a directly-generated ranked item list per user. However, this generative paradigm inherits the bias and opacity of autoregressive decoding. It over-emphasizes frequent (head) preferences and suppresses minority, long-tail ones.
Implicit Geographic Inference in LLM Medical Triage: Language-Driven Disparities in Emergency Recommendations
arXiv:2606.01204v1 Announce Type: new Abstract: We investigate whether large language models produce different medical triage recommendations for identical symptoms based solely on the language of the patient prompt. Using Gemini 3.5 Flash, we evaluate a neurological symptom profile (persistent headache, blurred vision, nausea) across six languages (English, Spanish, Chinese, Hindi, Japanese, Arabic) with 30 runs per condition (n=450 total API calls). We find that the model recommends...
The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection
arXiv:2606.03305v1 Announce Type: new Abstract: Benchmark contamination, where evaluation examples appear in a model's training data, threatens the validity of LLM assessment. Statistical tools for detecting training-data membership exist, but have been validated almost exclusively in controlled academic regimes: large, homogeneous pre-training corpora and transparent, single-stage training pipelines. Whether these methods remain reliable in realistic auditing scenarios remains unclear.
Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends
Announce Type: new Abstract: Serial LLM inference backends -- such as Ollama -- process requests one at a time under FCFS admission, causing Head-of-Line Blocking (HOLB) under mixed workloads at high utilisation: short factual queries can be delayed by minutes behind long generation jobs. While cloud-scale deployments mitigate HOLB via continuous batching (vLLM, Orca), these solutions require tens of GB of VRAM for concurrent KV-caches -- infeasible for memory-constrained edge and local...
CauTion: Knowing When to Trust LLMs for Ensemble Causal Discovery
arXiv:2606.03602v1 Announce Type: new Abstract: Causal discovery from observational data remains challenging due to the fundamental limitations of purely statistical methods, such as statistical distinguishability within equivalence classes and sensitivity to finite sample sizes. While large language models (LLMs) offer a promising source of domain knowledge to complement statistical inference, existing LLM-augmented methods are vulnerable to LLM errors and incur high token costs. Moreover,...
Inference Cost Attacks for Retrieval-Augmented Large Language Models
arXiv:2606.02643v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG)-enhanced LLM systems, while powerful, introduce substantial inference costs due to the inclusion of an extra multi-stage pipeline that dynamically retrieves and synthesizes information from external knowledge sources. This high operational cost exposes a critical vulnerability to Inference Cost Attacks (ICAs). However, existing ICAs often rely on the impractical assumption of direct prompt manipulation.
PRECISE: Reducing the Bias of LLM Evaluations Using Prediction-Powered Ranking Estimation
Announce Type: cross Abstract: Evaluating the quality of search, ranking and RAG systems traditionally requires a significant number of human relevance annotations. In recent times, several deployed systems have explored the usage of Large Language Models (LLMs) as automated judges for this task while their inherent biases prevent direct use for metric estimation. We present a statistical framework extending Prediction-Powered Inference (PPI) that combines minimal human annotations with LLM...
Emergence of Context Characteristics Sensitivity in Large Language Models
Announce Type: new Abstract: During instruction fine-tuning (IFT), large language models (LLMs) learn to follow instructions by using the provided context to answer a query. While prior work has studied how context characteristics correlate with context usage by the LLM, this analysis has been limited to inference time, leaving open how these relationships are acquired in the first place. Here, we measure how models' sensitivity to such characteristics shifts across successive IFT stages:...
Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation
arXiv:2512.16310v3 Announce Type: replace Abstract: LLM-based agents increasingly use multiple external tools to complete complex tasks. We study Tools Orchestration Privacy Risk (TOP-R): an agent may combine individually non-sensitive tool returns and disclose an unintended sensitive conclusion. We formalize TOP-R with three conditions: conclusion sensitivity, single-source non-inferability, and compositional inferability.
The Regularizing Power of Language-Training Deepfake Detectors
arXiv:2605.31192v1 Announce Type: new Abstract: Recently, thanks to the advent of Multimodal-LLMs, deepfake detectors are striving not only to be generalizable but also interpretable. We propose that these two challenges can effectively be tackled jointly, since describable artifacts typically generalize better, opening the possibility to use language as a regularization mechanism.