Modeling Synthetic Data Contamination
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Epidemiology of Model Collapse: Modeling Synthetic Data Contamination via Bilayer SIR Dynamics
arXiv:2606.05168v1 Announce Type: new Abstract: Training on synthetic data causes model collapse, but existing analyses treat this as single-chain degradation. In reality, the AI ecosystem involves cross-contamination: models ingest synthetic data from other models, produce new synthetic text, and contaminate shared corpora. We propose a bilayer coupled SIR/SIRS framework -- a phenomenological mean-field model treating data corpora and AI models as two interacting populations, each with...
Latent Performance Profiling of Large Language Models
Announce Type: replace Abstract: Large language models (LLMs) frequently achieve impressive scores on standardized benchmarks, yet accuracy alone offers a limited view of their capabilities. Evaluating open-source LLMs through leaderboards faces persistent issues like data contamination, narrow task scope, and weak alignment with real-world reliability. Benchmark-based evaluations such as MMLU PRO, BBH, or IFEval primarily capture what a model outputs on fixed test sets, not how it processes...
Dendrograms of Mixing Measures for Softmax-Gated Gaussian Mixture of Experts: Consistency Without Model Sweeps
Announce Type: replace-cross Abstract: We develop a unified statistical framework for softmax-gated Gaussian mixture of experts (SGMoE) that addresses three long-standing obstacles in parameter estimation and model selection: (i) non-identifiability of gating parameters up to common translations, (ii) intrinsic gate-expert interactions that induce coupled differential relations in the likelihood, and (iii) the tight numerator-denominator coupling in the softmax-induced conditional density....
An Expanded Synthetic Conversation Dataset for Multi-Turn Smishing Detection
Announce Type: new Abstract: Our prior work introduced COVA, a synthetically generated multi-turn conversational smishing dataset of 3,201 labeled conversations, establishing baseline detection benchmarks across eight models. While XGBoost with TF-IDF features achieved the best performance, with 72.5\% accuracy and 0.691 macro F1, transformer models underperformed, which was attributed to input truncation and insufficient training data. We present COVA-X, an expanded dataset of 10,985...
Regenerative farms lost three times less yield in France's droughts. Here's why
Regenerative farming could save enough wheat during drought to produce 130 million baguettes, according to a new French study. Faced with skyrocketing costs, supply shortages and extreme weather, Europe’s farmers are in crisis. With a hot summer looming, fuelled by human-caused climate change, drought is likely to take grip on the continent, further threatening food supplies and livelihoods.
Smoke engulfed their cities. Did it make their children sick?
Mothers fear children's chronic illnesses are linked to bushfire smoke during pregnancy Sun 31 May 2026 at 5:16am Six years after Black Summer bushfires, parents and doctors face an unsettling question: What does bushfire smoke do to babies in the womb? This story is a collaboration between the ABC's climate team and climate media organisation Grist. They never thought the fires would reach them.