Species diversity analysis of microbial communities is an important tool for assess- ing an ecosystem health. The advent of high-throughput genome sequencing tech- niques has made it possible to process an unprecedented number of RNA sequences. However, many studies report the presence of a significant number of fictitious rare species in datasets generated using these techniques. These species are the product of errors that can occur at any step of the sequence analysis pipeline. The overcount of rare species (especially singletons) affects the estimation of the total number of species, and of the diversity of the community as measured by Shannon’s index. To avoid overestimating these quantities, it is crucial to model the source of error. In this work, we present a new model that treats spurious singletons as false-negative record linkage errors, and compare it with another approach where spurious single- tons are considered for deletion. We discuss the two inferential approaches both with an application to real data and on theoretical grounds. We demonstrate that, while Shannon’s index can differ significantly under the two models, the estimate of the total number of species is equivalent.

Estimating the number of sequencing errors in microbial diversity studies / Di Cecco, Davide; Tancredi, Andrea. - In: ENVIRONMENTAL AND ECOLOGICAL STATISTICS. - ISSN 1573-3009. - (2024). [10.1007/s10651-024-00614-w]

Estimating the number of sequencing errors in microbial diversity studies

Andrea Tancredi
2024

Abstract

Species diversity analysis of microbial communities is an important tool for assess- ing an ecosystem health. The advent of high-throughput genome sequencing tech- niques has made it possible to process an unprecedented number of RNA sequences. However, many studies report the presence of a significant number of fictitious rare species in datasets generated using these techniques. These species are the product of errors that can occur at any step of the sequence analysis pipeline. The overcount of rare species (especially singletons) affects the estimation of the total number of species, and of the diversity of the community as measured by Shannon’s index. To avoid overestimating these quantities, it is crucial to model the source of error. In this work, we present a new model that treats spurious singletons as false-negative record linkage errors, and compare it with another approach where spurious single- tons are considered for deletion. We discuss the two inferential approaches both with an application to real data and on theoretical grounds. We demonstrate that, while Shannon’s index can differ significantly under the two models, the estimate of the total number of species is equivalent.
2024
Approximate Bayesian Computation; Linkage errors; Microbial diversity; Sequencing errors
01 Pubblicazione su rivista::01a Articolo in rivista
Estimating the number of sequencing errors in microbial diversity studies / Di Cecco, Davide; Tancredi, Andrea. - In: ENVIRONMENTAL AND ECOLOGICAL STATISTICS. - ISSN 1573-3009. - (2024). [10.1007/s10651-024-00614-w]
File allegati a questo prodotto
File Dimensione Formato  
Tancredi_Estimating-number_2024.pdf

solo gestori archivio

Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 1.2 MB
Formato Adobe PDF
1.2 MB Adobe PDF   Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1709378
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? 1
social impact