Science
Conditional Collapse in Sign Language Production: A Diagnostic and a Scaling Argument
Key Points
arXiv:2606.01643v1 Announce Type: new Abstract: Sign Language Production (SLP) is the task of generating avatar sign language motion from natural language text. The quality of the generated motion is typically evaluated by a motion-space Fr\'echet distance (FID) and back-translation (BT) BLEU score on benchmarks such as How2Sign.
arXiv:2606.01643v1 Announce Type: new
Abstract: Sign Language Production (SLP) is the task of generating avatar sign language motion from natural language
text. The quality of the generated motion is typically evaluated by a motion-space Fr\'echet distance (FID) and
back-translation (BT) BLEU score on benchmarks such as How2Sign. Both metrics can improve substantially
while the underlying generator fails to faithfully represent the sign language gestures. In this work we
propose to evaluate the generated motion at three independent levels: ({\tau}1) initial-pose conditioning, ({\tau}2)
output diversity, and ({\tau}3) target faithfulness. We compute these as pairwise-distance ratios using latent
representations of a frozen motion autoencoder (MoAE). We evaluate 14 SLP model checkpoints on the How2Sign
dataset, including a re-implemented Neural Sign Actors (NSA), and show that {\tau}3 faithfulness is never
attained, while FID varies by nearly two orders of magnitude and is uncorrelated with faithfulness. We show
that on the isolated gloss dataset ASL3DWord favorable {\tau}3 can be attained, hence isolating the size of the
sentence-level paired-dataset as the bottleneck.