The Analysis of High-Throughput Biological Datasets Utilising Distributed Computing

Jamie Alnasir

Research output: ThesisDoctoral Thesis

462 Downloads (Pure)


This thesis explores the analysis of high-throughput biological datasets using distributed computing, particularly sequencing data produced by high-throughput technologies, which is increasing at an unprecedented scale. These large, complex data sets are routinely deposited in public archives such as the SRA (Sequence Read Archive) - as of January 2017 the SRA alone contains over a Petabyte of data.

We conduct a detailed literature review into biochemical protocol steps applied in preparing nucleic acid samples for sequencing. We describe bias that can be introduced at the molecular level of sequencing workflow steps.

Investigating sequencing metadata, we quantified the level of annotation of 29,958 experiments deposited in the SRA by searching for keywords in protocol steps. We found that 7.10%, 5.84% and 7.57% of all records (fragmentation, ligation and enrichment, respectively) had at least one keyword corresponding to one of the three protocol steps. Only 5.58% of all SRA records had annotation for all three steps.

In researching the use of Hadoop in structural Biology, we tested MapReduce for processing semi-structured data in the Protein Data Bank (PDB). Hadoop was tested for executing molecular docking and structural analysis jobs in comparison to a batch-scheduler and was shown to be competitive.

Finally, we develop an analysis system using MapReduce on Spark that quantifies sequence-specific deviations in the distribution of mapped RNA-Seq reads. We apply this to perform analyses of two organisms. First, two transcriptomes of fruit fly D. melanogaster, sequenced from the same lab, differing only by mutation [gl60j] of the eye antennal disc. Second, three samples from H. Sapiens, prepared in a controlled way, using in-vitro transcription (IVT) with different RNA-Seq preparatory protocols applied.

The wild type D. melanogaster data indicates a variation due to motif GC content that is more significant than that found due to exon GC content. There is a clear variation in the spread of correlations between the two data sets suggesting more variability in these data sets, than one would expect, which we show to be the result of sequencing errors. The H. sapiens IVT-plasmids sample, which was the control - and had no ribosomal selection applied in RNA-Seq preparation - showed the least intra-exon deviation in the distribution of mapped reads to exons. We demonstrate that the dependence of intra-exon correlations on the GC content appears to be due to mRNA selection methods - techniques that are routinely employed in RNA-Seq experiments.

Our system is extremely scalable and suitable for systematic study of large, high-throughput sequencing datasets.
Original languageEnglish
Awarding Institution
  • Royal Holloway, University of London
  • Shanahan, Hugh, Supervisor
  • Gutin, Gregory, Advisor
Award date1 Jun 2018
Publication statusUnpublished - 2018


  • High-Throughput sequencing
  • Next-generation Sequencing
  • NGS
  • Transcriptomics
  • Genomics
  • MapReduce
  • Hadoop
  • Spark
  • Bioinformatics
  • Computational Biology
  • Batch-scheduled Computing
  • Cluster Computing
  • DNA-Sequencing
  • RNA-Sequencing
  • Nanopore-Sequencing
  • Metadata
  • Sequence Read Archive
  • High Performance Computing
  • Protein Databank
  • PDB
  • Nucleic Acid Sequencing
  • Bias
  • Workflows
  • Pipelines
  • HDFS
  • LSF
  • Openlava
  • Distributed Computing
  • IVT
  • In-vitro transcription
  • Drosophila melanogaster
  • Homo sapiens
  • PCR
  • Illumina
  • YARN
  • Apache
  • molecular docking
  • torsional angles
  • dihedral angles
  • Random Hexamer priming
  • GC content
  • mRNA selection
  • Read distribution
  • Vina
  • Structural Alignment
  • Sequence alignment
  • Sequencing workflows
  • sequence-specific bias

Cite this