Transcriptomics on Spark Workshop – Introducing Hercules – an Apache Spark MapReduce algorithm for quantifying non-uniform gene expression

Jamie Alnasir, Hugh Shanahan

Research output: Contribution to conferenceAbstractpeer-review


CloudTech’16 covers cloud technologies; architecture and applications including distributed computing and data centres, cloud infrastructure and its security, end-user services, Big data and its applications. This workshop talk focuses on the application of Apache Spark to an important area of research known as Transcriptomics which has a diverse range of applications across a number of fields. This workshop will be highly focused on how we have applied apache spark to analyse transcriptomics data and describes step-by-step a distributed algorithm we have devised called Hercules.

A cells transcriptome is defined as the sum total of all the messenger RNA molecules expressed from the genes of an organism and is highly dynamic and in a constant state of flux as a result of intra- and extra-cellular stimuli as well as disease pathology. RNA-Seq (RNA Sequencing) is a next-generation, high-throughput sequencing technology that enables researchers in the biomedical and basic science fields to study various aspects of the transcriptome from alternative splicing isoforms, post-transcriptional modifications to mutations and gene expression. Such studies rely on a common task in transcriptomics which is identifying those transcripts whose expression abundance is altered by experimental conditions which differ between sets of samples and which typically employs complex computational methods in the quantification of expression levels for observed transcripts.

RNA-Seq and other Next-generation sequencing techniques are prone to biases that may be introduced in a number of the steps of a typical sequencing workflow as well as downstream computational methods. This issue has been covered in detail in the authors’ paper entitled “Investigation into the annotation of protocol sequencing steps in the sequence read archive”.

In addition to the problems posed by potential biases in the data, data generated from transcriptomic studies tends to be large and complex - Bigdata. Delaney characterised such datasets as possessing volume, velocity and variety. We utilise distributed computing and employ bigdata analytics tools routinely used in industry (Apache Spark) to address the challenge. Hercules therefore applies a distributed programming paradigm (MapReduce) which is central to its implementation. To address part of the bias issue we have devised a methodology for quantifying and quality assessing non-uniform coverage of aligned transcriptomics reads within exons of a transcriptome given genome annotation and aligned reads data. We have applied Hercules to the analysis of the highly annotated Drosophila Melanogaster transcriptome and we will show our preliminary results as well as outline future leads we will pursue in developing the analyses.

CloudTech’16 is an IEEE event involving the British Council, EMC2, Vrije Universiteit Brussel and Universite de Mons.
Original languageEnglish
Publication statusPublished - 25 May 2016
EventCloudTech'16 - Morocco, Marrakech, Morocco
Duration: 24 May 201625 May 2016




  • transcriptomics
  • bigdata
  • mapreduce
  • spark
  • distributed algorithms
  • distributed computing
  • bioinformatics
  • drosophila

Cite this