Investigation into the annotation of protocol sequencing steps in the sequence read archive

Jamie Alnasir, Hugh Shanahan

Research output: Contribution to journalArticlepeer-review

166 Downloads (Pure)


The work-flow for the production of high-throughput sequencing data from nucleic acid samples is a complex one. There are a series of protocol steps in the preparation of samples for next generation sequencing. The quantification of bias remains to be determined in a number of protocol steps, namely DNA fractionation, blunting, phosphorylation, adapter ligation and library enrichment.
We examined the experimental metadata of the Sequence Read Archive (SRA), a public repository in order to ascertain the level of annotation of important sequencing steps in submissions to the database. Using SQL relational database queries (using the SRAdb SQLite database generated by the Bioconductor consortium) to search for keywords that commonly occur in key preparatory protocol steps (fragmentation, ligation and enrichment) partitioned over studies, we found that 7.10%, 5.84% and 7.57% of all records, respectively, had at least one keyword corresponding to one of the three protocol steps. Only 4.06% of all records, partitioned over studies, had keywords for all three protocol steps (5.58% of all SRA records).
The current level of annotation in the SRA inhibits systematic studies of bias due to these protocol steps. Downstream from this, meta-analyses and comparative studies based on this data will have a source of bias that at present cannot be quantified.
Original languageEnglish
Pages (from-to)1-11
Number of pages11
Issue number1
Early online date9 May 2015
Publication statusPublished - Dec 2015


  • Annotation
  • Sequencing
  • Next-generation
  • Ligation
  • Fragmentation
  • Enrichment
  • Protocol
  • Metadata
  • Experiment

Cite this