Technology
HNSW-MS: Hierarchical Graph Indexing Enables Accurate Real-Time Mass Spectral Similarity Search at Repository Scale
Key Points
Spectral similarity search is the basis of mass spectrometry-based metabolomics, underpinning library matching, molecular networks construction, and repository searches such as MASST. Until recently, dataset sizes were limited, making exhaustive pairwise comparison tractable. This is no longer true.
Spectral similarity search is the basis of mass spectrometry-based metabolomics, underpinning library matching, molecular networks construction, and repository searches such as MASST. Until recently, dataset sizes were limited, making exhaustive pairwise comparison tractable. This is no longer true. Public repositories such as GNPS now exceed one billion of spectra, and the emerging paradigm of reverse metabolomics (placing experimental spectra into the context of all existing public data to drive annotation and discovery) demands search at a scale where linear sequential comparison is no longer viable. We introduce HNSW-MS, which implements Hierarchical Navigable Small World graph indexing natively for mass spectral similarity, operating directly on raw GC-MS and LC-MS/MS spectra without preprocessing or embedding, thus ensuring maximum reproducibility. Validated on the 8.4 million MS/MS spectra, HNSW-MS achieves up to 560-fold acceleration over linear scan while maintaining top-1 recall above 90%, with perfect recall achievable at moderate parameter settings. This acceleration removes the search bottleneck at repository scale, enabling near real-time spectral querying against the entirety of public metabolomics data.