Home Science Mapping Chemical Diversity: Descriptor-Guided Clustering...
Science

Mapping Chemical Diversity: Descriptor-Guided Clustering of Natural Products in the COCONUT Database

Key Points

Natural products represent a major source of bioactive compounds for drug discovery, yet their exploration remains challenging due to extensive structural complexity and scaffold diversity. Using the COCONUT database, we developed a cluster-oriented framework to systematically map and characterize the natural product chemical space through feature engineering, molecular clustering, and representative-based analysis. Descriptor selection identified a greedy maximum coverage strategy with a...

Natural products represent a major source of bioactive compounds for drug discovery, yet their exploration remains challenging due to extensive structural complexity and scaffold diversity. Using the COCONUT database, we developed a cluster-oriented framework to systematically map and characterize the natural product chemical space through feature engineering, molecular clustering, and representative-based analysis. Descriptor selection identified a greedy maximum coverage strategy with a 0.35-0.85 correlation threshold range and 20 descriptors as the optimal feature set, enriched in physicochemical and graph-topological properties. Comparative evaluation of clustering approaches identified UMAP-HDBSCAN as the best-performing pipeline, generating 1,683 clusters with silhouette scores of 0.42 before and 0.24 after noise reassignment. Cluster profiling revealed a highly heterogeneous scaffold landscape, with 67.56% of clusters exhibiting low scaffold dominance and only 15.21% representing highly scaffold-dominated regions, supporting a chemical space composed largely of interconnected transitional clusters. Descriptor analyses showed that natural product clusters were generally enriched in saturated, low-aromaticity chemotypes with moderate lipophilicity and constrained molecular flexibility. Representative-based analyses demonstrated that central representatives (medoid and centroid-closest molecules) closely captured cluster-average properties, whereas diverse representatives better reflected structural breadth, findings further supported through descriptor-based and docking-based validation. Collectively, the results reinforce the natural product chemical space as a continuous yet structured manifold and provide a representative-guided framework for its efficient exploration in drug discovery applications. The complete data can be accessed at: https://github.com/shrek-28/DescriptorClusteringNPSpace
COCONUT (ORG) https://github.com/shrek-28/DescriptorClusteringNPSpace (ORG)
Originally published by bioRxiv Read original →