Science
Beyond Individual Personas: Aligning Synthetic Dialogue to Population-Level Behavior Distributions
Key Points
arXiv:2606.07893v1 Announce Type: new Abstract: Synthetic dialogue corpora are increasingly used as proxies for target dialogue data, yet persona-grounded generators optimize individual conversations rather than corpus composition, yielding locally plausible dialogues with distorted population-level behavior mixes. We introduce GroupPersona, a framework that aligns synthetic dialogue corpora to the behavior distribution of a reference corpus. GroupPersona turns population statistics into...
arXiv:2606.07893v1 Announce Type: new
Abstract: Synthetic dialogue corpora are increasingly used as proxies for target dialogue data, yet persona-grounded generators optimize individual conversations rather than corpus composition, yielding locally plausible dialogues with distorted population-level behavior mixes. We introduce GroupPersona, a framework that aligns synthetic dialogue corpora to the behavior distribution of a reference corpus. GroupPersona turns population statistics into generation controls: it separates each dialogue's core behavioral signature from predictable side effects, and uses the resulting behavioral groups to condition user agents on the interaction patterns that define the reference population. We evaluate GroupPersona on four corpora crossing two dialogue sources, assistant-style and Reddit-derived, with two construction variants: structure-preserving and variation-enhanced. GroupPersona lowers Jensen-Shannon divergence between synthetic and reference distributions over 12 behavior attributes from 0.234 to 0.177 relative to the strongest average baseline, a 24.4% reduction, and is best or tied-best on all four corpora while preserving structural alignment. It also achieves the closest calibration to reference-conversation quality scores, reducing mean absolute deviation from the reference-conversation profile to 0.63 versus 0.91 for the next-best baseline.