Transcriptomics on Spark Workshop – Introducing Hercules – an Apache Spark MapReduce algorithm for quantifying non-uniform gene expression

Jamie Alnasir; Hugh Shanahan

Transcriptomics on Spark Workshop – Introducing Hercules – an Apache Spark MapReduce algorithm for quantifying non-uniform gene expression

Research output: Contribution to conference › Abstract › peer-review

Abstract

CloudTech’16 covers cloud technologies; architecture and applications including distributed computing and data centres, cloud infrastructure and its security, end-user services, Big data and its applications. This workshop talk focuses on the application of Apache Spark to an important area of research known as Transcriptomics which has a diverse range of applications across a number of fields. This workshop will be highly focused on how we have applied apache spark to analyse transcriptomics data and describes step-by-step a distributed algorithm we have devised called Hercules.

A cells transcriptome is defined as the sum total of all the messenger RNA molecules expressed from the genes of an organism and is highly dynamic and in a constant state of flux as a result of intra- and extra-cellular stimuli as well as disease pathology. RNA-Seq (RNA Sequencing) is a next-generation, high-throughput sequencing technology that enables researchers in the biomedical and basic science fields to study various aspects of the transcriptome from alternative splicing isoforms, post-transcriptional modifications to mutations and gene expression. Such studies rely on a common task in transcriptomics which is identifying those transcripts whose expression abundance is altered by experimental conditions which differ between sets of samples and which typically employs complex computational methods in the quantification of expression levels for observed transcripts.

RNA-Seq and other Next-generation sequencing techniques are prone to biases that may be introduced in a number of the steps of a typical sequencing workflow as well as downstream computational methods. This issue has been covered in detail in the authors’ paper entitled “Investigation into the annotation of protocol sequencing steps in the sequence read archive”.

In addition to the problems posed by potential biases in the data, data generated from transcriptomic studies tends to be large and complex - Bigdata. Delaney characterised such datasets as possessing volume, velocity and variety. We utilise distributed computing and employ bigdata analytics tools routinely used in industry (Apache Spark) to address the challenge. Hercules therefore applies a distributed programming paradigm (MapReduce) which is central to its implementation. To address part of the bias issue we have devised a methodology for quantifying and quality assessing non-uniform coverage of aligned transcriptomics reads within exons of a transcriptome given genome annotation and aligned reads data. We have applied Hercules to the analysis of the highly annotated Drosophila Melanogaster transcriptome and we will show our preliminary results as well as outline future leads we will pursue in developing the analyses.

CloudTech’16 is an IEEE event involving the British Council, EMC2, Vrije Universiteit Brussel and Universite de Mons.

Original language	English
Publication status	Published - 25 May 2016
Event	CloudTech'16 - Morocco, Marrakech, Morocco Duration: 24 May 2016 → 25 May 2016

Conference

Conference	CloudTech'16
Country/Territory	Morocco
City	Marrakech
Period	24/05/16 → 25/05/16

Keywords

transcriptomics
bigdata
mapreduce
spark
distributed algorithms
distributed computing
bioinformatics
drosophila

Access to Document

http://www.macc.ma/cloudtech16/Abstract_Transcriptomics-on-Spark-Workshop.docLicence: Not specified

Cite this

@conference{75f62c1cc32a4711a56d6803b29fd9eb,

title = "Transcriptomics on Spark Workshop – Introducing Hercules – an Apache Spark MapReduce algorithm for quantifying non-uniform gene expression",

abstract = "CloudTech{\textquoteright}16 covers cloud technologies; architecture and applications including distributed computing and data centres, cloud infrastructure and its security, end-user services, Big data and its applications. This workshop talk focuses on the application of Apache Spark to an important area of research known as Transcriptomics which has a diverse range of applications across a number of fields. This workshop will be highly focused on how we have applied apache spark to analyse transcriptomics data and describes step-by-step a distributed algorithm we have devised called Hercules.A cells transcriptome is defined as the sum total of all the messenger RNA molecules expressed from the genes of an organism and is highly dynamic and in a constant state of flux as a result of intra- and extra-cellular stimuli as well as disease pathology. RNA-Seq (RNA Sequencing) is a next-generation, high-throughput sequencing technology that enables researchers in the biomedical and basic science fields to study various aspects of the transcriptome from alternative splicing isoforms, post-transcriptional modifications to mutations and gene expression. Such studies rely on a common task in transcriptomics which is identifying those transcripts whose expression abundance is altered by experimental conditions which differ between sets of samples and which typically employs complex computational methods in the quantification of expression levels for observed transcripts.RNA-Seq and other Next-generation sequencing techniques are prone to biases that may be introduced in a number of the steps of a typical sequencing workflow as well as downstream computational methods. This issue has been covered in detail in the authors{\textquoteright} paper entitled “Investigation into the annotation of protocol sequencing steps in the sequence read archive”.In addition to the problems posed by potential biases in the data, data generated from transcriptomic studies tends to be large and complex - Bigdata. Delaney characterised such datasets as possessing volume, velocity and variety. We utilise distributed computing and employ bigdata analytics tools routinely used in industry (Apache Spark) to address the challenge. Hercules therefore applies a distributed programming paradigm (MapReduce) which is central to its implementation. To address part of the bias issue we have devised a methodology for quantifying and quality assessing non-uniform coverage of aligned transcriptomics reads within exons of a transcriptome given genome annotation and aligned reads data. We have applied Hercules to the analysis of the highly annotated Drosophila Melanogaster transcriptome and we will show our preliminary results as well as outline future leads we will pursue in developing the analyses.CloudTech{\textquoteright}16 is an IEEE event involving the British Council, EMC2, Vrije Universiteit Brussel and Universite de Mons.",

keywords = "transcriptomics, bigdata, mapreduce, spark, distributed algorithms, distributed computing, bioinformatics, drosophila",

author = "Jamie Alnasir and Hugh Shanahan",

year = "2016",

month = may,

day = "25",

language = "English",

note = "CloudTech'16 ; Conference date: 24-05-2016 Through 25-05-2016",

}

TY - CONF

T1 - Transcriptomics on Spark Workshop – Introducing Hercules – an Apache Spark MapReduce algorithm for quantifying non-uniform gene expression

AU - Alnasir, Jamie

AU - Shanahan, Hugh

PY - 2016/5/25

Y1 - 2016/5/25

N2 - CloudTech’16 covers cloud technologies; architecture and applications including distributed computing and data centres, cloud infrastructure and its security, end-user services, Big data and its applications. This workshop talk focuses on the application of Apache Spark to an important area of research known as Transcriptomics which has a diverse range of applications across a number of fields. This workshop will be highly focused on how we have applied apache spark to analyse transcriptomics data and describes step-by-step a distributed algorithm we have devised called Hercules.A cells transcriptome is defined as the sum total of all the messenger RNA molecules expressed from the genes of an organism and is highly dynamic and in a constant state of flux as a result of intra- and extra-cellular stimuli as well as disease pathology. RNA-Seq (RNA Sequencing) is a next-generation, high-throughput sequencing technology that enables researchers in the biomedical and basic science fields to study various aspects of the transcriptome from alternative splicing isoforms, post-transcriptional modifications to mutations and gene expression. Such studies rely on a common task in transcriptomics which is identifying those transcripts whose expression abundance is altered by experimental conditions which differ between sets of samples and which typically employs complex computational methods in the quantification of expression levels for observed transcripts.RNA-Seq and other Next-generation sequencing techniques are prone to biases that may be introduced in a number of the steps of a typical sequencing workflow as well as downstream computational methods. This issue has been covered in detail in the authors’ paper entitled “Investigation into the annotation of protocol sequencing steps in the sequence read archive”.In addition to the problems posed by potential biases in the data, data generated from transcriptomic studies tends to be large and complex - Bigdata. Delaney characterised such datasets as possessing volume, velocity and variety. We utilise distributed computing and employ bigdata analytics tools routinely used in industry (Apache Spark) to address the challenge. Hercules therefore applies a distributed programming paradigm (MapReduce) which is central to its implementation. To address part of the bias issue we have devised a methodology for quantifying and quality assessing non-uniform coverage of aligned transcriptomics reads within exons of a transcriptome given genome annotation and aligned reads data. We have applied Hercules to the analysis of the highly annotated Drosophila Melanogaster transcriptome and we will show our preliminary results as well as outline future leads we will pursue in developing the analyses.CloudTech’16 is an IEEE event involving the British Council, EMC2, Vrije Universiteit Brussel and Universite de Mons.

AB - CloudTech’16 covers cloud technologies; architecture and applications including distributed computing and data centres, cloud infrastructure and its security, end-user services, Big data and its applications. This workshop talk focuses on the application of Apache Spark to an important area of research known as Transcriptomics which has a diverse range of applications across a number of fields. This workshop will be highly focused on how we have applied apache spark to analyse transcriptomics data and describes step-by-step a distributed algorithm we have devised called Hercules.A cells transcriptome is defined as the sum total of all the messenger RNA molecules expressed from the genes of an organism and is highly dynamic and in a constant state of flux as a result of intra- and extra-cellular stimuli as well as disease pathology. RNA-Seq (RNA Sequencing) is a next-generation, high-throughput sequencing technology that enables researchers in the biomedical and basic science fields to study various aspects of the transcriptome from alternative splicing isoforms, post-transcriptional modifications to mutations and gene expression. Such studies rely on a common task in transcriptomics which is identifying those transcripts whose expression abundance is altered by experimental conditions which differ between sets of samples and which typically employs complex computational methods in the quantification of expression levels for observed transcripts.RNA-Seq and other Next-generation sequencing techniques are prone to biases that may be introduced in a number of the steps of a typical sequencing workflow as well as downstream computational methods. This issue has been covered in detail in the authors’ paper entitled “Investigation into the annotation of protocol sequencing steps in the sequence read archive”.In addition to the problems posed by potential biases in the data, data generated from transcriptomic studies tends to be large and complex - Bigdata. Delaney characterised such datasets as possessing volume, velocity and variety. We utilise distributed computing and employ bigdata analytics tools routinely used in industry (Apache Spark) to address the challenge. Hercules therefore applies a distributed programming paradigm (MapReduce) which is central to its implementation. To address part of the bias issue we have devised a methodology for quantifying and quality assessing non-uniform coverage of aligned transcriptomics reads within exons of a transcriptome given genome annotation and aligned reads data. We have applied Hercules to the analysis of the highly annotated Drosophila Melanogaster transcriptome and we will show our preliminary results as well as outline future leads we will pursue in developing the analyses.CloudTech’16 is an IEEE event involving the British Council, EMC2, Vrije Universiteit Brussel and Universite de Mons.

KW - transcriptomics

KW - bigdata

KW - mapreduce

KW - spark

KW - distributed algorithms

KW - distributed computing

KW - bioinformatics

KW - drosophila

M3 - Abstract

T2 - CloudTech'16

Y2 - 24 May 2016 through 25 May 2016

ER -