Home › Knowledge Base › Modeling Synthetic Data Contamination

Modeling Synthetic Data Contamination

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Epidemiology of Model Collapse: Modeling Synthetic Data Contamination via Bilayer SIR Dynamics

arXiv:2606.05168v1 Announce Type: new Abstract: Training on synthetic data causes model collapse, but existing analyses treat this as single-chain degradation. In reality, the AI ecosystem involves cross-contamination: models ingest synthetic data from other models, produce new synthetic text, and contaminate shared corpora. We propose a bilayer coupled SIR/SIRS framework -- a phenomenological mean-field model treating data corpora and AI models as two interacting populations, each with...

arXiv CS 5d ago

Latent Performance Profiling of Large Language Models

Announce Type: replace Abstract: Large language models (LLMs) frequently achieve impressive scores on standardized benchmarks, yet accuracy alone offers a limited view of their capabilities. Evaluating open-source LLMs through leaderboards faces persistent issues like data contamination, narrow task scope, and weak alignment with real-world reliability. Benchmark-based evaluations such as MMLU PRO, BBH, or IFEval primarily capture what a model outputs on fixed test sets, not how it processes...

arXiv CS 9d ago

Dendrograms of Mixing Measures for Softmax-Gated Gaussian Mixture of Experts: Consistency Without Model Sweeps

Announce Type: replace-cross Abstract: We develop a unified statistical framework for softmax-gated Gaussian mixture of experts (SGMoE) that addresses three long-standing obstacles in parameter estimation and model selection: (i) non-identifiability of gating parameters up to common translations, (ii) intrinsic gate-expert interactions that induce coupled differential relations in the likelihood, and (iii) the tight numerator-denominator coupling in the softmax-induced conditional density....

arXiv CS 1d ago

An Expanded Synthetic Conversation Dataset for Multi-Turn Smishing Detection

Announce Type: new Abstract: Our prior work introduced COVA, a synthetically generated multi-turn conversational smishing dataset of 3,201 labeled conversations, establishing baseline detection benchmarks across eight models. While XGBoost with TF-IDF features achieved the best performance, with 72.5\% accuracy and 0.691 macro F1, transformer models underperformed, which was attributed to input truncation and insufficient training data. We present COVA-X, an expanded dataset of 10,985...

arXiv CS 2d ago

Regenerative farms lost three times less yield in France's droughts. Here's why

Regenerative farming could save enough wheat during drought to produce 130 million baguettes, according to a new French study. Faced with skyrocketing costs, supply shortages and extreme weather, Europe’s farmers are in crisis. With a hot summer looming, fuelled by human-caused climate change, drought is likely to take grip on the continent, further threatening food supplies and livelihoods.

Euronews 6d ago

Smoke engulfed their cities. Did it make their children sick?

Mothers fear children's chronic illnesses are linked to bushfire smoke during pregnancy Sun 31 May 2026 at 5:16am Six years after Black Summer bushfires, parents and doctors face an unsettling question: What does bushfire smoke do to babies in the womb? This story is a collaboration between the ABC's climate team and climate media organisation Grist. They never thought the fires would reach them.

ABC Australia 10d ago