A Novel Method to Detect Bias in Short Read NGS Data

Jamie Alnasir; Hugh Shanahan

doi:10.1515/jib-2017-0025

A Novel Method to Detect Bias in Short Read NGS Data

Research output: Contribution to journal › Article › peer-review

Abstract

Detecting sources of bias in transcriptomic data is essential to determine signals of Biological significance. We outline a novel method to detect sequence specific bias in short read Next Generation Sequencing data. This is based on determining intra-exon correlations between specific motifs. This requires a mild assumption that short reads sampled from specific regions from the same exon will be correlated with each other. This has been implemented on Apache Spark and and used to analyse two D. melanogaster eye-antennal disc data sets generated at the same laboratory. The wild type data set indicates a variation due to motif GC content that is more significant than that found due to exon GC content. There is a clear variation in the spread of correlations between the two data sets suggesting more variability in these data sets than one would expect.

Original language	English
Pages (from-to)	1-9
Number of pages	9
Journal	Journal of Integrative Bioinformatics
Volume	14
Issue number	3
Early online date	23 Sept 2017
DOIs	https://doi.org/10.1515/jib-2017-0025
Publication status	Published - 2017

Keywords

NGS
BIAS
Spark
Hadoop

Access to Document

10.1515/jib-2017-0025Licence: CC BY-NC-ND

Cite this

@article{5e1de062cf9648b3b98f2cbc7ecaf0ab,

title = "A Novel Method to Detect Bias in Short Read NGS Data",

abstract = "Detecting sources of bias in transcriptomic data is essential to determine signals of Biological significance. We outline a novel method to detect sequence specific bias in short read Next Generation Sequencing data. This is based on determining intra-exon correlations between specific motifs. This requires a mild assumption that short reads sampled from specific regions from the same exon will be correlated with each other. This has been implemented on Apache Spark and and used to analyse two D. melanogaster eye-antennal disc data sets generated at the same laboratory. The wild type data set indicates a variation due to motif GC content that is more significant than that found due to exon GC content. There is a clear variation in the spread of correlations between the two data sets suggesting more variability in these data sets than one would expect. ",

keywords = "NGS, BIAS, Spark, Hadoop",

author = "Jamie Alnasir and Hugh Shanahan",

year = "2017",

doi = "10.1515/jib-2017-0025",

language = "English",

volume = "14",

pages = "1--9",

journal = "Journal of Integrative Bioinformatics",

issn = "1613-4516",

publisher = "Informationsmanagement in der Biotechnologie e.V. (IMBio e.V.)",

number = "3",

}

TY - JOUR

T1 - A Novel Method to Detect Bias in Short Read NGS Data

AU - Alnasir, Jamie

AU - Shanahan, Hugh

PY - 2017

Y1 - 2017

N2 - Detecting sources of bias in transcriptomic data is essential to determine signals of Biological significance. We outline a novel method to detect sequence specific bias in short read Next Generation Sequencing data. This is based on determining intra-exon correlations between specific motifs. This requires a mild assumption that short reads sampled from specific regions from the same exon will be correlated with each other. This has been implemented on Apache Spark and and used to analyse two D. melanogaster eye-antennal disc data sets generated at the same laboratory. The wild type data set indicates a variation due to motif GC content that is more significant than that found due to exon GC content. There is a clear variation in the spread of correlations between the two data sets suggesting more variability in these data sets than one would expect.

AB - Detecting sources of bias in transcriptomic data is essential to determine signals of Biological significance. We outline a novel method to detect sequence specific bias in short read Next Generation Sequencing data. This is based on determining intra-exon correlations between specific motifs. This requires a mild assumption that short reads sampled from specific regions from the same exon will be correlated with each other. This has been implemented on Apache Spark and and used to analyse two D. melanogaster eye-antennal disc data sets generated at the same laboratory. The wild type data set indicates a variation due to motif GC content that is more significant than that found due to exon GC content. There is a clear variation in the spread of correlations between the two data sets suggesting more variability in these data sets than one would expect.

KW - NGS

KW - BIAS

KW - Spark

KW - Hadoop

U2 - 10.1515/jib-2017-0025

DO - 10.1515/jib-2017-0025

M3 - Article

SN - 1613-4516

VL - 14

SP - 1

EP - 9

JO - Journal of Integrative Bioinformatics

JF - Journal of Integrative Bioinformatics

IS - 3

ER -