Transcriptomics: Leveraging a MapReduce algorithm and Python for gene-expression analysis on Apache Spark

Jamie Alnasir, Hugh Shanahan (Editor)

Research output: Contribution to conferenceAbstractpeer-review


MatBio’16 (Mathematical Foundations in Bioinformatics) held at King's college London and organised PhD students of the Algorithms & Bioinformatics group covers combinatorics, statistics and general mathematical principles for solving problems in Bioinformatics and is co-sponsored by the London Mathematical Society.
This talk makes use of mathematical notation and diagrams to describe the use of MapRedue and our implementation of it in the Hercules algorithm we have devised to study k-mer motifs and uniformity of read coverage in Transcriptomics data.
RNA-Seq (RNA Sequencing) is a next-generation, high-throughput sequencing technology employed in the field of Transcriptomics. A common task in transcriptomics is identifying those transcripts whose expression abundance is altered by experimental conditions which differ between sets of samples and which typically employs complex computational methods in the quantification of expression levels for observed transcripts.  
RNA-Seq enables researchers in the biomedical and basic science fields to study various aspects of the transcriptome from alternative splicing isoforms, post-transcriptional modifications to mutations and gene expression. RNA-Seq and other Next-generation sequencing techniques are prone to biases that may be introduced in a number of the steps of a typical sequencing workflow as well as downstream computational methods. This issue has been covered in detail in the authors’ paper entitled “Investigation into the annotation of protocol sequencing steps in the sequence read archive”. Following on from this we have devised a distributed MapReduce algorithm (Hercules) to analyse transcriptomics data, in particular we are interested in examining non-uniform gene expression.
Data generated from transcriptomic studies tends to be large and complex - Bigdata. Delaney characterised such datasets as possessing volume, velocity and variety. To study biases in the data we therefore utilise distributed computing and employ bigdata analytics tools routinely used in industry (Apache Spark) to address the challenge. The Hercules algorithm therefore applies a distributed programming paradigm (MapReduce) which is central to its implementation. To address part of the bias issue we have devised a methodology for quantifying and quality assessing non-uniform coverage of aligned transcriptomics reads within exons of a transcriptome given genome annotation and aligned reads data.
We have applied Hercules to the analysis of the highly annotated Drosophila Melanogaster transcriptome. We investigated how the GC content of particular motifs, median GC content of a given exon and the random hexamer primer effects described by Hansen et al influence uniformity of read coverage. This is crucial to the measurement of gene expression. We will show our preliminary results and will discuss how we have developed the analyses and the techniques and technologies employed.
Original languageEnglish
Publication statusPublished - 20 Jul 2016
EventMatBio '16 - Kings College London, London, United Kingdom
Duration: 20 Jul 201620 Jul 2016


ConferenceMatBio '16
Country/TerritoryUnited Kingdom


  • mathematics
  • mapreduce
  • map-reduce
  • python
  • bioinformatics
  • algorithms
  • distributed algorithms
  • transcriptomics
  • drosophila

Cite this