This talk was delivered as a guest lecture for the Bioinformatics module CS3110/CS5110 for the Computer Science courses at Royal Holloway.
With a focus on Next-generation RNA-Seq, the talk first recaps on basic molecular biology definitions relevant for genomics and transcriptomics. It gives an insight into the complexity of fully quantifying a transcriptome, for instance in H. sapiens there are ~3.72x10^13 cells, 300,000 RNA molecules per cell and an average gene size of ~28 kilobase pairs - this means a full representation of the transcriptome in Human comprises of approximately 3.1x10^23 (28,000 x 300,000 x 10^13) RNA bases. Sequencing of the transcriptome (and genome) therefore generates large amounts of very complicated raw experimental sequencing data -- typically "bigdata".
The talk then discusses Sanger sequencing and massively parallel Next-generation Sequencing (NGS). The process of miniturisation and massive parallelisation is depicted and explained for NGS technologies.
The most common task in transcriptomics - estimating gene expression - is discussed with respect to RNA-Seq, in which we discuss read distribution, normalising expression estimates and the types of biases that can be introduced.
We conclude by introducing two common transcriptomics file formats, GTF/GFF for genome annotation and SAM for aligned reads, and by briefly discussing analyses of RNA-seq. “bigdata” using cluster computing and the cloud.
16 Mar 2017
Guest lecture for the Bioinformatics module CS3110/CS5110, Royal Holloway