Transcriptomics: Leveraging a MapReduce algorithm and Python for gene-expression analysis on Apache Spark

Jamie Alnasir; Hugh Shanahan

Transcriptomics: Leveraging a MapReduce algorithm and Python for gene-expression analysis on Apache Spark

Jamie Alnasir, Hugh Shanahan (Editor)

Research output: Contribution to conference › Abstract › peer-review

Abstract

MatBio’16 (Mathematical Foundations in Bioinformatics) held at King's college London and organised PhD students of the Algorithms & Bioinformatics group covers combinatorics, statistics and general mathematical principles for solving problems in Bioinformatics and is co-sponsored by the London Mathematical Society.

This talk makes use of mathematical notation and diagrams to describe the use of MapRedue and our implementation of it in the Hercules algorithm we have devised to study k-mer motifs and uniformity of read coverage in Transcriptomics data.

RNA-Seq (RNA Sequencing) is a next-generation, high-throughput sequencing technology employed in the field of Transcriptomics. A common task in transcriptomics is identifying those transcripts whose expression abundance is altered by experimental conditions which differ between sets of samples and which typically employs complex computational methods in the quantification of expression levels for observed transcripts.
RNA-Seq enables researchers in the biomedical and basic science fields to study various aspects of the transcriptome from alternative splicing isoforms, post-transcriptional modifications to mutations and gene expression. RNA-Seq and other Next-generation sequencing techniques are prone to biases that may be introduced in a number of the steps of a typical sequencing workflow as well as downstream computational methods. This issue has been covered in detail in the authors’ paper entitled “Investigation into the annotation of protocol sequencing steps in the sequence read archive”. Following on from this we have devised a distributed MapReduce algorithm (Hercules) to analyse transcriptomics data, in particular we are interested in examining non-uniform gene expression.

Data generated from transcriptomic studies tends to be large and complex - Bigdata. Delaney characterised such datasets as possessing volume, velocity and variety. To study biases in the data we therefore utilise distributed computing and employ bigdata analytics tools routinely used in industry (Apache Spark) to address the challenge. The Hercules algorithm therefore applies a distributed programming paradigm (MapReduce) which is central to its implementation. To address part of the bias issue we have devised a methodology for quantifying and quality assessing non-uniform coverage of aligned transcriptomics reads within exons of a transcriptome given genome annotation and aligned reads data.

We have applied Hercules to the analysis of the highly annotated Drosophila Melanogaster transcriptome. We investigated how the GC content of particular motifs, median GC content of a given exon and the random hexamer primer effects described by Hansen et al influence uniformity of read coverage. This is crucial to the measurement of gene expression. We will show our preliminary results and will discuss how we have developed the analyses and the techniques and technologies employed.

Original language	English
Publication status	Published - 20 Jul 2016
Event	MatBio '16 - Kings College London, London, United Kingdom Duration: 20 Jul 2016 → 20 Jul 2016

Conference

Conference	MatBio '16
Country/Territory	United Kingdom
City	London
Period	20/07/16 → 20/07/16

Keywords

mathematics
mapreduce
map-reduce
python
bioinformatics
algorithms
distributed algorithms
transcriptomics
drosophila

Cite this

@conference{e5e4d853de354bac8a3191ba10a75f3a,

title = "Transcriptomics: Leveraging a MapReduce algorithm and Python for gene-expression analysis on Apache Spark",

abstract = "MatBio{\textquoteright}16 (Mathematical Foundations in Bioinformatics) held at King's college London and organised PhD students of the Algorithms & Bioinformatics group covers combinatorics, statistics and general mathematical principles for solving problems in Bioinformatics and is co-sponsored by the London Mathematical Society. This talk makes use of mathematical notation and diagrams to describe the use of MapRedue and our implementation of it in the Hercules algorithm we have devised to study k-mer motifs and uniformity of read coverage in Transcriptomics data. RNA-Seq (RNA Sequencing) is a next-generation, high-throughput sequencing technology employed in the field of Transcriptomics. A common task in transcriptomics is identifying those transcripts whose expression abundance is altered by experimental conditions which differ between sets of samples and which typically employs complex computational methods in the quantification of expression levels for observed transcripts. RNA-Seq enables researchers in the biomedical and basic science fields to study various aspects of the transcriptome from alternative splicing isoforms, post-transcriptional modifications to mutations and gene expression. RNA-Seq and other Next-generation sequencing techniques are prone to biases that may be introduced in a number of the steps of a typical sequencing workflow as well as downstream computational methods. This issue has been covered in detail in the authors{\textquoteright} paper entitled “Investigation into the annotation of protocol sequencing steps in the sequence read archive”. Following on from this we have devised a distributed MapReduce algorithm (Hercules) to analyse transcriptomics data, in particular we are interested in examining non-uniform gene expression. Data generated from transcriptomic studies tends to be large and complex - Bigdata. Delaney characterised such datasets as possessing volume, velocity and variety. To study biases in the data we therefore utilise distributed computing and employ bigdata analytics tools routinely used in industry (Apache Spark) to address the challenge. The Hercules algorithm therefore applies a distributed programming paradigm (MapReduce) which is central to its implementation. To address part of the bias issue we have devised a methodology for quantifying and quality assessing non-uniform coverage of aligned transcriptomics reads within exons of a transcriptome given genome annotation and aligned reads data. We have applied Hercules to the analysis of the highly annotated Drosophila Melanogaster transcriptome. We investigated how the GC content of particular motifs, median GC content of a given exon and the random hexamer primer effects described by Hansen et al influence uniformity of read coverage. This is crucial to the measurement of gene expression. We will show our preliminary results and will discuss how we have developed the analyses and the techniques and technologies employed.",

keywords = "mathematics, mapreduce, map-reduce, python, bioinformatics, algorithms, distributed algorithms, transcriptomics, drosophila",

author = "Jamie Alnasir and Hugh Shanahan",

year = "2016",

month = jul,

day = "20",

language = "English",

note = "MatBio '16 ; Conference date: 20-07-2016 Through 20-07-2016",

}

TY - CONF

T1 - Transcriptomics: Leveraging a MapReduce algorithm and Python for gene-expression analysis on Apache Spark

AU - Alnasir, Jamie

A2 - Shanahan, Hugh

PY - 2016/7/20

Y1 - 2016/7/20

N2 - MatBio’16 (Mathematical Foundations in Bioinformatics) held at King's college London and organised PhD students of the Algorithms & Bioinformatics group covers combinatorics, statistics and general mathematical principles for solving problems in Bioinformatics and is co-sponsored by the London Mathematical Society. This talk makes use of mathematical notation and diagrams to describe the use of MapRedue and our implementation of it in the Hercules algorithm we have devised to study k-mer motifs and uniformity of read coverage in Transcriptomics data. RNA-Seq (RNA Sequencing) is a next-generation, high-throughput sequencing technology employed in the field of Transcriptomics. A common task in transcriptomics is identifying those transcripts whose expression abundance is altered by experimental conditions which differ between sets of samples and which typically employs complex computational methods in the quantification of expression levels for observed transcripts. RNA-Seq enables researchers in the biomedical and basic science fields to study various aspects of the transcriptome from alternative splicing isoforms, post-transcriptional modifications to mutations and gene expression. RNA-Seq and other Next-generation sequencing techniques are prone to biases that may be introduced in a number of the steps of a typical sequencing workflow as well as downstream computational methods. This issue has been covered in detail in the authors’ paper entitled “Investigation into the annotation of protocol sequencing steps in the sequence read archive”. Following on from this we have devised a distributed MapReduce algorithm (Hercules) to analyse transcriptomics data, in particular we are interested in examining non-uniform gene expression. Data generated from transcriptomic studies tends to be large and complex - Bigdata. Delaney characterised such datasets as possessing volume, velocity and variety. To study biases in the data we therefore utilise distributed computing and employ bigdata analytics tools routinely used in industry (Apache Spark) to address the challenge. The Hercules algorithm therefore applies a distributed programming paradigm (MapReduce) which is central to its implementation. To address part of the bias issue we have devised a methodology for quantifying and quality assessing non-uniform coverage of aligned transcriptomics reads within exons of a transcriptome given genome annotation and aligned reads data. We have applied Hercules to the analysis of the highly annotated Drosophila Melanogaster transcriptome. We investigated how the GC content of particular motifs, median GC content of a given exon and the random hexamer primer effects described by Hansen et al influence uniformity of read coverage. This is crucial to the measurement of gene expression. We will show our preliminary results and will discuss how we have developed the analyses and the techniques and technologies employed.

AB - MatBio’16 (Mathematical Foundations in Bioinformatics) held at King's college London and organised PhD students of the Algorithms & Bioinformatics group covers combinatorics, statistics and general mathematical principles for solving problems in Bioinformatics and is co-sponsored by the London Mathematical Society. This talk makes use of mathematical notation and diagrams to describe the use of MapRedue and our implementation of it in the Hercules algorithm we have devised to study k-mer motifs and uniformity of read coverage in Transcriptomics data. RNA-Seq (RNA Sequencing) is a next-generation, high-throughput sequencing technology employed in the field of Transcriptomics. A common task in transcriptomics is identifying those transcripts whose expression abundance is altered by experimental conditions which differ between sets of samples and which typically employs complex computational methods in the quantification of expression levels for observed transcripts. RNA-Seq enables researchers in the biomedical and basic science fields to study various aspects of the transcriptome from alternative splicing isoforms, post-transcriptional modifications to mutations and gene expression. RNA-Seq and other Next-generation sequencing techniques are prone to biases that may be introduced in a number of the steps of a typical sequencing workflow as well as downstream computational methods. This issue has been covered in detail in the authors’ paper entitled “Investigation into the annotation of protocol sequencing steps in the sequence read archive”. Following on from this we have devised a distributed MapReduce algorithm (Hercules) to analyse transcriptomics data, in particular we are interested in examining non-uniform gene expression. Data generated from transcriptomic studies tends to be large and complex - Bigdata. Delaney characterised such datasets as possessing volume, velocity and variety. To study biases in the data we therefore utilise distributed computing and employ bigdata analytics tools routinely used in industry (Apache Spark) to address the challenge. The Hercules algorithm therefore applies a distributed programming paradigm (MapReduce) which is central to its implementation. To address part of the bias issue we have devised a methodology for quantifying and quality assessing non-uniform coverage of aligned transcriptomics reads within exons of a transcriptome given genome annotation and aligned reads data. We have applied Hercules to the analysis of the highly annotated Drosophila Melanogaster transcriptome. We investigated how the GC content of particular motifs, median GC content of a given exon and the random hexamer primer effects described by Hansen et al influence uniformity of read coverage. This is crucial to the measurement of gene expression. We will show our preliminary results and will discuss how we have developed the analyses and the techniques and technologies employed.

KW - mathematics

KW - mapreduce

KW - map-reduce

KW - python

KW - bioinformatics

KW - algorithms

KW - distributed algorithms

KW - transcriptomics

KW - drosophila

M3 - Abstract

T2 - MatBio '16

Y2 - 20 July 2016 through 20 July 2016

ER -