Unraveling the complexity of the human transcriptome: analysis and integration of high-throughput data

Le Pera, Loredana

BACKGROUND: An undoubted outcome of the last decade genomics research is the evidence of our lack of knowledge about the complexity of the human transcriptome. Researchers found, at first, that there were far fewer genes than expected (<25,000 genes coding for proteins), only to discover later that there were far more non protein-coding transcripts than expected (~30,000). Surprisingly, the transcriptome diversity is also due to a wider occurrence of alternative splicing than previously suspected. It is now clear that about 95% of the human multi-exon genes undergo alternative splicing events. At the same time, a number of studies from others and from our group have shown that a significant fraction of all generated transcripts, or isoforms, are unlikely to be translated into functional protein products. Although the emerging complex picture of transcriptomes is exciting, and genome-scale information is freely accessible from public repositories, there is still to develop approaches that permit a comprehensive characterization and efficient validation of the data. In this context, the progress and contributions of next-generation sequencing techniques have been crucial. By generating high-throughput experimental data and performing bioinformatics processing analysis, it is potentially possible to investigate and compare both protein- and non protein-coding element behaviours in specific biological and biomedical problems. Both the large-scale data generated from international consortium projects and the high-throughput experimental data produced in specific experiments challenge us to translate this massive information into meaningful biological insights. AIM: My thesis focuses on the identification and characterization of functional protein-coding and non-protein-coding products of the human transcriptome, by integrating a combination of various computational methods and biological data. The first part of the study aims at defining a strategy to assess the protein-coding potential of alternative splicing products, by analyzing all genome-scale transcripts available in public repositories. Sequence and structure features of known proteins are investigated and combined using bioinformatics approaches. This study attempts to assess the ability of different criteria to detect most of the actually translated isoforms and can be of great help in estimating the real size of the human proteome (an information that is still missing). The second part of the study aims at detecting functional protein-coding and non-protein-coding transcripts, analyzing high-throughput RNA-sequencing expression data in specific biomedical problems. The analysis investigates mRNA, taking into account all the possible isoforms, and microRNA expression profiles. In order to elucidate and assess a putative regulatory key role of microRNAs in the biological processes under study, a procedure is developed to identify enriched microRNA/mRNA-target associations. RESULTS: The first analysis was focused on human protein-coding genes having at least one protein isoform unequivocally identified by mass-spectrometry peptide data (Positive dataset) and at least one other isoform with no evidence of translation (Unknown dataset). A number of sequence and structural features of typical functional proteins were used to compare the properties of the two isoform datasets. We found out that Positive isoforms are predicted to be structurally more plausible than Unknown isoforms; functional domains are more often truncated in Unknown isoforms than in Positive ones; functional features such as active sites are rarely disrupted in Positive isoforms. Combining the presence of non-truncated functional domains with an assessment of the plausibility of the modelled structure, the estimated percentage of non-plausible protein-coding transcripts is about 45% of the Unknown isoforms. To capture further differences between functional and non functional transcripts, the following features were also investigated: selection pressure, conservation among multiple species and stringent comparison between human and mouse. The second analysis aimed at detecting all the putative functional transcripts involved in biological processes investigated by RNA-sequencing technique. Specialized bioinformatics procedures were set up to analyze both mRNA and microRNA expression profiles and were applied to specific biomedical problems in collaboration with experimental groups. Additionally, integrating also microRNA/target-gene prediction algorithms, the method allows the automatic identification of differentially expressed microRNAs predicted to bind mRNAs that are inversely-regulated in the same experiment. This combined analysis of microRNA and target-mRNA expression data is useful to isolate putative regulatory circuits involved in the investigated biological processes.

Unraveling the complexity of the human transcriptome: analysis and integration of high-throughput data / LE PERA, Loredana. - (2013 Feb 27).