Home Science samsampleX: Distribution-aware downsampling for...
Science

samsampleX: Distribution-aware downsampling for benchmarking next-generation sequencing data

Key Points

High-throughput next-generation sequencing (NGS) is essential for genetic variant discovery across diverse applications. As NGS evolve, there is a growing need for benchmarking tools that support realistic data simulation and downsampling. Existing downsampling tools apply uniform sampling of sequencing reads, which inadequately models realistic coverage distributions, particularly in difficult-to-sequence regions and hybrid sequencing designs.

High-throughput next-generation sequencing (NGS) is essential for genetic variant discovery across diverse applications. As NGS evolve, there is a growing need for benchmarking tools that support realistic data simulation and downsampling. Existing downsampling tools apply uniform sampling of sequencing reads, which inadequately models realistic coverage distributions, particularly in difficult-to-sequence regions and hybrid sequencing designs. Here we present samsampleX, a Python-based tool implementing a novel distribution-aware downsampling algorithm that dynamically adjusts read retention probabilities to emulate coverage profiles derived from real sequencing data. Using ultra-high-coverage reference datasets, samsampleX accurately reproduces coverage patterns observed in typical sequencing experiments, outperforming uniform downsampling methods at preserving depth variability across genomic regions such as the HLA locus and hybrid whole-exome/genome sequencing configurations. samsampleX extends current downsampling strategies by offering enhanced flexibility for specialized NGS benchmarking scenarios, facilitating improved assessment of sequencing data analysis methods.
NGS (ORG) Python (ORG) HLA (ORG)
Originally published by bioRxiv Read original →