Science
Do Larger Models Really Win in Drug Discovery? A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction
Key Points
Announce Type: replace Abstract: The rapid growth of molecular foundation models and large language models (LLMs) has encouraged a scale centred view of AI in drug discovery, in which larger pretrained models are expected to supersede compact cheminformatics models. We test this assumption across 26 ADME, toxicity and bioactivity endpoints, covering 165,541 endpoint level compound label records. The benchmark contains 78 endpoint and split entries evaluated under random, Murcko scaffold and...
arXiv:2604.26498v3 Announce Type: replace
Abstract: The rapid growth of molecular foundation models and large language models (LLMs) has encouraged a scale centred view of AI in drug discovery, in which larger pretrained models are expected to supersede compact cheminformatics models. We test this assumption across 26 ADME, toxicity and bioactivity endpoints, covering 165,541 endpoint level compound label records. The benchmark contains 78 endpoint and split entries evaluated under random, Murcko scaffold and structure separated 5-fold cross validation protocols, representing increasing chemical generalization difficulty. Across 156 task and metric comparisons, classical machine learning (ML) provides the largest share of best performing entries (47.4%), followed by pretrained molecular sequence models (28.8%), graph neural networks (21.8%) and LLM based SAR baselines (1.9%). Classical ML dominates random split interpolation and remains the largest winner family overall. GNN and sequence models are competitive in selected harder splits, but their strict winner shares decrease under a fixed final-window readout, indicating sensitivity to training settings and model selection. Paired bootstrap analyses show that small numerical differences between individual models should not be read as decisive victories. SAR knowledge from training folds improves GPT5.5-SAR and Opus4.7-SAR metrics but does not make rule based reasoning a universal substitute for supervised predictors. Compact specialized models remain highly effective, and predictive performance depends on the fit among model, task and validation scenario, not on scale alone.