Home › Business & Finance › TaxoFormer: Hierarchical Transformer for Predicting the...

Business & Finance

TaxoFormer: Hierarchical Transformer for Predicting the Full Taxonomic Lineage of Protein Sequences

bioRxiv Tuesday 09 June 2026, 00:00 UTC By Parsa, M., Azimian, K., Wei, K. Y. 1 min read

Key Points

Predicting labels in massive, hierarchically structured output spaces is a core challenge in machine learning. In this work, we use the problem of predicting the full taxonomic lineage of a protein from its sequence as a case study for this challenge. We introduce TaxoFormer, an architecture whose primary contribution is a structured tokenization scheme that losslessly represents the entire NCBI phylogenetic tree, a graph with over 1.3 million nodes using a compact vocabulary of just 15,000 tokens. By coupling a pre-trained ESM-2 model with an autoregressive decoder and training with a standard cross-entropy objective, we test the hypothesis that a simple generative objective is sufficient to learn complex, latent structure when the output space is explicitly modeled. We show that this approach is highly effective: on a dataset of 188 million proteins, the model not only achieves accurate lineage prediction but also implicitly learns a continuous, phylogenetically-structured latent space. This work provides a scalable, alignment-free method for taxonomic annotation and demonstrates that explicitly modeling the structure of a complex output space is a powerful mechanism for learning meaningful representations.

TaxoFormer (ORG)

Originally published by bioRxiv Read original →

TaxoFormer: Hierarchical Transformer for Predicting the Full Taxonomic Lineage of Protein Sequences

Related Stories

Global Citizen CEO on First-Ever FIFA Halftime Show

Bisignano says Social Security Administration's phone helpline wait times have reached a record low

Retired Gen. Kimmitt: Hormuz, Lebanon Are ‘Diversions'

US Treasury Eases Legal Restrictions Across Venezuela Licenses