The Analysis of High-Throughput Biological Datasets Utilising Distributed Computing

Jamie Alnasir

The Analysis of High-Throughput Biological Datasets Utilising Distributed Computing

Jamie Alnasir

Department of Computer Science

Research output: Thesis › Doctoral Thesis

441 Downloads (Pure)

Abstract

This thesis explores the analysis of high-throughput biological datasets using distributed computing, particularly sequencing data produced by high-throughput technologies, which is increasing at an unprecedented scale. These large, complex data sets are routinely deposited in public archives such as the SRA (Sequence Read Archive) - as of January 2017 the SRA alone contains over a Petabyte of data.

We conduct a detailed literature review into biochemical protocol steps applied in preparing nucleic acid samples for sequencing. We describe bias that can be introduced at the molecular level of sequencing workflow steps.

Investigating sequencing metadata, we quantified the level of annotation of 29,958 experiments deposited in the SRA by searching for keywords in protocol steps. We found that 7.10%, 5.84% and 7.57% of all records (fragmentation, ligation and enrichment, respectively) had at least one keyword corresponding to one of the three protocol steps. Only 5.58% of all SRA records had annotation for all three steps.

In researching the use of Hadoop in structural Biology, we tested MapReduce for processing semi-structured data in the Protein Data Bank (PDB). Hadoop was tested for executing molecular docking and structural analysis jobs in comparison to a batch-scheduler and was shown to be competitive.

Finally, we develop an analysis system using MapReduce on Spark that quantifies sequence-specific deviations in the distribution of mapped RNA-Seq reads. We apply this to perform analyses of two organisms. First, two transcriptomes of fruit fly D. melanogaster, sequenced from the same lab, differing only by mutation [gl60j] of the eye antennal disc. Second, three samples from H. Sapiens, prepared in a controlled way, using in-vitro transcription (IVT) with different RNA-Seq preparatory protocols applied.

The wild type D. melanogaster data indicates a variation due to motif GC content that is more significant than that found due to exon GC content. There is a clear variation in the spread of correlations between the two data sets suggesting more variability in these data sets, than one would expect, which we show to be the result of sequencing errors. The H. sapiens IVT-plasmids sample, which was the control - and had no ribosomal selection applied in RNA-Seq preparation - showed the least intra-exon deviation in the distribution of mapped reads to exons. We demonstrate that the dependence of intra-exon correlations on the GC content appears to be due to mRNA selection methods - techniques that are routinely employed in RNA-Seq experiments.

Our system is extremely scalable and suitable for systematic study of large, high-throughput sequencing datasets.

Original language	English
Qualification	Ph.D.
Awarding Institution	Royal Holloway, University of London
Supervisors/Advisors	Shanahan, Hugh, Supervisor Gutin, Gregory, Advisor
Award date	1 Jun 2018
Publication status	Unpublished - 2018

Keywords

High-Throughput sequencing
Next-generation Sequencing
NGS
Transcriptomics
Genomics
MapReduce
Hadoop
Spark
Bioinformatics
Computational Biology
Batch-scheduled Computing
Cluster Computing
DNA-Sequencing
RNA-Sequencing
Nanopore-Sequencing
Metadata
Sequence Read Archive
High Performance Computing
Protein Databank
PDB
Nucleic Acid Sequencing
Bias
Workflows
Pipelines
HDFS
LSF
Openlava
Distributed Computing
IVT
In-vitro transcription
Drosophila melanogaster
Homo sapiens
PCR
Illumina
YARN
Apache
molecular docking
torsional angles
dihedral angles
Random Hexamer priming
GC content
mRNA selection
Read distribution
Vina
Structural Alignment
Sequence alignment
Sequencing workflows
sequence-specific bias

Access to Document

2018alnasirjphd.pdfOther version, 5.27 MBLicence: Not specified

Cite this

@phdthesis{c523fb1939cf43ab92847a58d4d646fa,

title = "The Analysis of High-Throughput Biological Datasets Utilising Distributed Computing",

abstract = "This thesis explores the analysis of high-throughput biological datasets using distributed computing, particularly sequencing data produced by high-throughput technologies, which is increasing at an unprecedented scale. These large, complex data sets are routinely deposited in public archives such as the SRA (Sequence Read Archive) - as of January 2017 the SRA alone contains over a Petabyte of data.We conduct a detailed literature review into biochemical protocol steps applied in preparing nucleic acid samples for sequencing. We describe bias that can be introduced at the molecular level of sequencing workflow steps.Investigating sequencing metadata, we quantified the level of annotation of 29,958 experiments deposited in the SRA by searching for keywords in protocol steps. We found that 7.10%, 5.84% and 7.57% of all records (fragmentation, ligation and enrichment, respectively) had at least one keyword corresponding to one of the three protocol steps. Only 5.58% of all SRA records had annotation for all three steps.In researching the use of Hadoop in structural Biology, we tested MapReduce for processing semi-structured data in the Protein Data Bank (PDB). Hadoop was tested for executing molecular docking and structural analysis jobs in comparison to a batch-scheduler and was shown to be competitive.Finally, we develop an analysis system using MapReduce on Spark that quantifies sequence-specific deviations in the distribution of mapped RNA-Seq reads. We apply this to perform analyses of two organisms. First, two transcriptomes of fruit fly D. melanogaster, sequenced from the same lab, differing only by mutation [gl60j] of the eye antennal disc. Second, three samples from H. Sapiens, prepared in a controlled way, using in-vitro transcription (IVT) with different RNA-Seq preparatory protocols applied.The wild type D. melanogaster data indicates a variation due to motif GC content that is more significant than that found due to exon GC content. There is a clear variation in the spread of correlations between the two data sets suggesting more variability in these data sets, than one would expect, which we show to be the result of sequencing errors. The H. sapiens IVT-plasmids sample, which was the control - and had no ribosomal selection applied in RNA-Seq preparation - showed the least intra-exon deviation in the distribution of mapped reads to exons. We demonstrate that the dependence of intra-exon correlations on the GC content appears to be due to mRNA selection methods - techniques that are routinely employed in RNA-Seq experiments.Our system is extremely scalable and suitable for systematic study of large, high-throughput sequencing datasets.",

keywords = "High-Throughput sequencing, Next-generation Sequencing, NGS, Transcriptomics, Genomics, MapReduce, Hadoop, Spark, Bioinformatics, Computational Biology, Batch-scheduled Computing, Cluster Computing, DNA-Sequencing, RNA-Sequencing, Nanopore-Sequencing, Metadata, Sequence Read Archive, High Performance Computing, Protein Databank, PDB, Nucleic Acid Sequencing, Bias, Workflows, Pipelines, HDFS, LSF, Openlava, Distributed Computing, IVT, In-vitro transcription, Drosophila melanogaster, Homo sapiens, PCR, Illumina, YARN, Apache, molecular docking, torsional angles, dihedral angles, Random Hexamer priming, GC content, mRNA selection, Read distribution, Vina, Structural Alignment, Sequence alignment, Sequencing workflows, sequence-specific bias",

author = "Jamie Alnasir",

year = "2018",

language = "English",

school = "Royal Holloway, University of London",

}

TY - BOOK

T1 - The Analysis of High-Throughput Biological Datasets Utilising Distributed Computing

AU - Alnasir, Jamie

PY - 2018

Y1 - 2018

N2 - This thesis explores the analysis of high-throughput biological datasets using distributed computing, particularly sequencing data produced by high-throughput technologies, which is increasing at an unprecedented scale. These large, complex data sets are routinely deposited in public archives such as the SRA (Sequence Read Archive) - as of January 2017 the SRA alone contains over a Petabyte of data.We conduct a detailed literature review into biochemical protocol steps applied in preparing nucleic acid samples for sequencing. We describe bias that can be introduced at the molecular level of sequencing workflow steps.Investigating sequencing metadata, we quantified the level of annotation of 29,958 experiments deposited in the SRA by searching for keywords in protocol steps. We found that 7.10%, 5.84% and 7.57% of all records (fragmentation, ligation and enrichment, respectively) had at least one keyword corresponding to one of the three protocol steps. Only 5.58% of all SRA records had annotation for all three steps.In researching the use of Hadoop in structural Biology, we tested MapReduce for processing semi-structured data in the Protein Data Bank (PDB). Hadoop was tested for executing molecular docking and structural analysis jobs in comparison to a batch-scheduler and was shown to be competitive.Finally, we develop an analysis system using MapReduce on Spark that quantifies sequence-specific deviations in the distribution of mapped RNA-Seq reads. We apply this to perform analyses of two organisms. First, two transcriptomes of fruit fly D. melanogaster, sequenced from the same lab, differing only by mutation [gl60j] of the eye antennal disc. Second, three samples from H. Sapiens, prepared in a controlled way, using in-vitro transcription (IVT) with different RNA-Seq preparatory protocols applied.The wild type D. melanogaster data indicates a variation due to motif GC content that is more significant than that found due to exon GC content. There is a clear variation in the spread of correlations between the two data sets suggesting more variability in these data sets, than one would expect, which we show to be the result of sequencing errors. The H. sapiens IVT-plasmids sample, which was the control - and had no ribosomal selection applied in RNA-Seq preparation - showed the least intra-exon deviation in the distribution of mapped reads to exons. We demonstrate that the dependence of intra-exon correlations on the GC content appears to be due to mRNA selection methods - techniques that are routinely employed in RNA-Seq experiments.Our system is extremely scalable and suitable for systematic study of large, high-throughput sequencing datasets.

AB - This thesis explores the analysis of high-throughput biological datasets using distributed computing, particularly sequencing data produced by high-throughput technologies, which is increasing at an unprecedented scale. These large, complex data sets are routinely deposited in public archives such as the SRA (Sequence Read Archive) - as of January 2017 the SRA alone contains over a Petabyte of data.We conduct a detailed literature review into biochemical protocol steps applied in preparing nucleic acid samples for sequencing. We describe bias that can be introduced at the molecular level of sequencing workflow steps.Investigating sequencing metadata, we quantified the level of annotation of 29,958 experiments deposited in the SRA by searching for keywords in protocol steps. We found that 7.10%, 5.84% and 7.57% of all records (fragmentation, ligation and enrichment, respectively) had at least one keyword corresponding to one of the three protocol steps. Only 5.58% of all SRA records had annotation for all three steps.In researching the use of Hadoop in structural Biology, we tested MapReduce for processing semi-structured data in the Protein Data Bank (PDB). Hadoop was tested for executing molecular docking and structural analysis jobs in comparison to a batch-scheduler and was shown to be competitive.Finally, we develop an analysis system using MapReduce on Spark that quantifies sequence-specific deviations in the distribution of mapped RNA-Seq reads. We apply this to perform analyses of two organisms. First, two transcriptomes of fruit fly D. melanogaster, sequenced from the same lab, differing only by mutation [gl60j] of the eye antennal disc. Second, three samples from H. Sapiens, prepared in a controlled way, using in-vitro transcription (IVT) with different RNA-Seq preparatory protocols applied.The wild type D. melanogaster data indicates a variation due to motif GC content that is more significant than that found due to exon GC content. There is a clear variation in the spread of correlations between the two data sets suggesting more variability in these data sets, than one would expect, which we show to be the result of sequencing errors. The H. sapiens IVT-plasmids sample, which was the control - and had no ribosomal selection applied in RNA-Seq preparation - showed the least intra-exon deviation in the distribution of mapped reads to exons. We demonstrate that the dependence of intra-exon correlations on the GC content appears to be due to mRNA selection methods - techniques that are routinely employed in RNA-Seq experiments.Our system is extremely scalable and suitable for systematic study of large, high-throughput sequencing datasets.

KW - High-Throughput sequencing

KW - Next-generation Sequencing

KW - NGS

KW - Transcriptomics

KW - Genomics

KW - MapReduce

KW - Hadoop

KW - Spark

KW - Bioinformatics

KW - Computational Biology

KW - Batch-scheduled Computing

KW - Cluster Computing

KW - DNA-Sequencing

KW - RNA-Sequencing

KW - Nanopore-Sequencing

KW - Metadata

KW - Sequence Read Archive

KW - High Performance Computing

KW - Protein Databank

KW - PDB

KW - Nucleic Acid Sequencing

KW - Bias

KW - Workflows

KW - Pipelines

KW - HDFS

KW - LSF

KW - Openlava

KW - Distributed Computing

KW - IVT

KW - In-vitro transcription

KW - Drosophila melanogaster

KW - Homo sapiens

KW - PCR

KW - Illumina

KW - YARN

KW - Apache

KW - molecular docking

KW - torsional angles

KW - dihedral angles

KW - Random Hexamer priming

KW - GC content

KW - mRNA selection

KW - Read distribution

KW - Vina

KW - Structural Alignment

KW - Sequence alignment

KW - Sequencing workflows

KW - sequence-specific bias

M3 - Doctoral Thesis

ER -