Species diversity analysis of microbial communities is an important tool for assess- ing an ecosystem health. The advent of high-throughput genome sequencing tech- niques has made it possible to process an unprecedented number of RNA sequences. However, many studies report the presence of a significant number of fictitious rare species in datasets generated using these techniques. These species are the product of errors that can occur at any step of the sequence analysis pipeline. The overcount of rare species (especially singletons) affects the estimation of the total number of species, and of the diversity of the community as measured by Shannon’s index. To avoid overestimating these quantities, it is crucial to model the source of error. In this work, we present a new model that treats spurious singletons as false-negative record linkage errors, and compare it with another approach where spurious single- tons are considered for deletion. We discuss the two inferential approaches both with an application to real data and on theoretical grounds. We demonstrate that, while Shannon’s index can differ significantly under the two models, the estimate of the total number of species is equivalent.
Estimating the number of sequencing errors in microbial diversity studies / Di Cecco, Davide; Tancredi, Andrea. - In: ENVIRONMENTAL AND ECOLOGICAL STATISTICS. - ISSN 1573-3009. - (2024). [10.1007/s10651-024-00614-w]
Estimating the number of sequencing errors in microbial diversity studies
Andrea Tancredi
2024
Abstract
Species diversity analysis of microbial communities is an important tool for assess- ing an ecosystem health. The advent of high-throughput genome sequencing tech- niques has made it possible to process an unprecedented number of RNA sequences. However, many studies report the presence of a significant number of fictitious rare species in datasets generated using these techniques. These species are the product of errors that can occur at any step of the sequence analysis pipeline. The overcount of rare species (especially singletons) affects the estimation of the total number of species, and of the diversity of the community as measured by Shannon’s index. To avoid overestimating these quantities, it is crucial to model the source of error. In this work, we present a new model that treats spurious singletons as false-negative record linkage errors, and compare it with another approach where spurious single- tons are considered for deletion. We discuss the two inferential approaches both with an application to real data and on theoretical grounds. We demonstrate that, while Shannon’s index can differ significantly under the two models, the estimate of the total number of species is equivalent.File | Dimensione | Formato | |
---|---|---|---|
Tancredi_Estimating-number_2024.pdf
solo gestori archivio
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
1.2 MB
Formato
Adobe PDF
|
1.2 MB | Adobe PDF | Contatta l'autore |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.