Intra-exon motif correlations as a proxy measure for mean per-tile sequence quality data in RNA-Seq

Jamie J. Alnasir, Hugh P. Shanahan

Research output: Contribution to journalArticlepeer-review

1 Downloads (Pure)

Abstract

Given the wide variability in the quality of NGS data submitted to public repositories, it is essential to
identify methods that can perform quality control on these datasets when additional quality control
data, such as mean tile data, is missing. This is particularly important because such datasets are
routinely deposited in public archives that now store data at an unprecedented scale. In this paper,
we show that correlating counts of reads corresponding to pairs of motifs separated over specific
distances on individual exons corresponds to mean tile data in the datasets we analysed, and can
therefore be used when mean tile data is not available.
As test datasets we use the H. sapiens IVT (in-vitro transcribed) dataset of Lahens et al., and a D.
melanogaster dataset comprising wild and mutant types from Aerts et al.
The intra-exon motif correlations as a function of both GC content parameters are much higher
in the IVT-Plasmids mRNA selection free RNA-Seq sample (control) than in the other RNA-Seq
samples that did undergo mRNA selection: both ribosomal depletion (IVT-Only) and PolyA selection
(IVT-polyA, wild-type, and mutant). There is considerable degradation of similar correlations in the
mutant samples from the D. melanogaster dataset. This matches with the available mean tile data that
has been gathered for these datasets. We observe that extremely low correlations are indicative of bias
of technical origin, such as flowcell errors
Original languageEnglish
Pages (from-to)131-148
Number of pages18
JournalJournal of Computational Biology
Volume30
Issue number2
Early online date7 Feb 2023
DOIs
Publication statusPublished - Feb 2023

Keywords

  • rna-seq

Cite this