Weather
Phase transition in large language models and the criticality of natural languages
Key Points
Announce Type: replace-cross Abstract: Generation of text and speech in natural languages can be modeled as a stochastic process. This idea dates back to the seminal work of Markov and, later, to that of Shannon and also underlies the recent development of large language models (LLMs).
arXiv:2406.05335v3 Announce Type: replace-cross
Abstract: Generation of text and speech in natural languages can be modeled as a stochastic process. This idea dates back to the seminal work of Markov and, later, to that of Shannon and also underlies the recent development of large language models (LLMs). The stochastic processes corresponding to natural languages should be distinct from those that generate nonlinguistic sequences. One of the features that discriminate linguistic and nonlinguistic sequences is power-law behavior, which is universally observed across different languages. In statistical physics, such behavior suggests that natural languages are critical: They lie near a phase transition point in a parametrized space of stochastic processes. However, testing this conjecture is not straightforward. A phase transition, even if it exists, cannot be directly observed in real-world natural languages because they do not have any controllable parameters. Here, we use LLMs as controllable effective models of natural languages. Through statistical analyses of texts generated by LLMs, we find that, when a parameter analogous to physical temperature is varied, LLMs undergo a phase transition. The transition separates a low-temperature phase with complex repetitive structures in generated texts from a high-temperature phase in which LLMs generate incomprehensible texts. At the critical point between these phases, generated texts display the power-law behavior similar to that of natural languages and most closely resemble natural languages as measured by a standard metric in natural language processing. These findings strongly suggest that natural languages are indeed critical.