Transcriptomics: Quantifying Non-Uniform Read Distribution Using MapReduce

Jamie Alnasir, Hugh Shanahan

Research output: Contribution to journalArticlepeer-review

87 Downloads (Pure)


RNA-seq is a high-throughput Next-sequencing technique for estimating the concentration of all transcripts in a transcriptome. The method involves complex preparatory and post-processing steps which can introduce bias, and the technique produces a large amount of data [7, 19]. Two important challenges in processing RNA-seq data are therefore the ability to process a vast amount of data, and methods to quantify the bias in public RNA-seq datasets. We describe a novel analysis method, based on analysing sequence motif correlations, that employs MapReduce on Apache Spark to quantify bias in Next-generation sequencing (NGS) data at the deep exon level. Our implementation is designed specifically for processing large datasets and allows for scalability and deployment on cloud service providers offering MapReduce. In investigating the wild and mutant organism types in the species D. melanogaster we have found that motifs with runs of Gs (or their complement) exhibit low motif-pair correlations in comparison with other motif-pairs. This is independent of the mean exon GC content in the wild type data, but there is a mild dependence in the mutant data. Hence, whilst both datasets show the same trends, there is however significant variation between the two samples.
Original languageEnglish
Number of pages20
JournalInternational Journal for the Foundations of Computer Science
Issue number8
Publication statusPublished - 27 Dec 2018


  • rna-seq
  • mapreduce
  • transcriptomics
  • 4-mers
  • motif analysis
  • drosophila

Cite this