ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models

bioRxiv Monday 08 June 2026, 00:00 UTC By Garibbo, M., Boxo Corominas, G., Stocco, F., Illanes Vicioso, R., Middendorf, L., Ferruz, N. 1 min read

Key Points

Generative protein language models (pLMs) enable exploration of vast sequence spaces for protein design, but reliably controlling generation toward desired functional families remains challenging. While protein generation has broadly followed trends in NLP, two directions remain underexplored: alignment methods that optimize model behavior toward design objectives, and prompting-based control at inference time without fine-tuning. We introduce ProtGPT3, an open-source family of protein language models spanning 112M to 10B parameters and integrated with the Hugging Face ecosystem. The suite includes both single-sequence and multiple sequence alignment (MSA)-promptable models, enabling flexible conditioning for generation. Across model scales and control settings, we systematically compare supervised fine-tuning and few-shot prompting using homologous sequences. Analogous to how large language models (LLMs) are routinely aligned with user intent, we study post-training alignment in single-sequence models using sequence-complexity and structure-confidence metrics across the proteome. We find that alignment reduces low-complexity generations while preserving sequence diversity. Furthermore, we show that few-shot prompting is a competitive and more scalable alternative to supervised fine-tuning for controlled generation. In a low-data defluorinase case study, ProtGPT3-MSA achieved higher computational success rates than fine-tuned baselines and produced designs that were soluble and expressed following experimental validation. Finally, we explore the potential of inference-time compute in MSA models by introducing a homolog-based Feynman--Kac inference procedure for steering protein generation toward desired targets. We make our models publicly available at https://huggingface.co/collections/AI4PD/protgpt3-family .

NLP (ORG) MSA (ORG) Feynman--Kac (PERSON)

Originally published by bioRxiv Read original →

ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models

Related Stories

Popular UK seaside town hotel plunges into administration as holidaymakers updated

Scientists were excited about a blood test for many cancers — but it failed a big trial. Here's what to know.

After NSIL’s PPP bid, IN-SPACe opens LVM-3 to private sector with ToT push

NASA chief defends all-male Artemis 3 astronaut crew amid backlash: 'I don't think anyone should be reading into this'