Science
Disentangling RNA evolution and thermodynamics in genomic language models
Key Points
Genomic language models (gLMs) trained only on large-scale nucleic acid sequence data seem to capture signals of RNA structure, yet the specifics of how remain unclear. Using the categorical Jacobian (CJ) operation, a model-agnostic operation for querying pairwise dependencies, we systematically compared three flagship gLMs: RNA-FM, Evo 2, and gLM2. We found that CJ signals recover base pairs supported by evolutionary covariation analyses, consistent with findings in protein language models.
Genomic language models (gLMs) trained only on large-scale nucleic acid sequence data seem to capture signals of RNA structure, yet the specifics of how remain unclear. Using the categorical Jacobian (CJ) operation, a model-agnostic operation for querying pairwise dependencies, we systematically compared three flagship gLMs: RNA-FM, Evo 2, and gLM2. We found that CJ signals recover base pairs supported by evolutionary covariation analyses, consistent with findings in protein language models. Surprisingly, CJ also recovers base pairs lacking evolutionary support but predicted by biophysical nearest-neighbor models. Is it possible gLMs have "learned" RNA thermodynamics? We noticed nearest-neighbor RNA folding models often predict reflected structures when given reversed sequences, consistent with these models' modular and grammar-like nature. We leveraged this observation to create a simple "mirror test" that we found gLMs routinely fail, indicating they have not learned generalizable biophysics-based rules for RNA structure. Nevertheless, their apparent thermodynamic signal potentially confounds interpreting gLM pairwise dependencies as evidence of evolutionary conservation. We therefore introduce a method using synthetic sequences as a control for detecting significant learned signal. Our results demonstrate that gLMs can mimic thermodynamics through learned sequence context rather than general physical principles, but solutions exist for disentangling patterns in language models.