Abstract
Background
The work-flow for the production of high-throughput sequencing data from nucleic acid samples is a complex one. There are a series of protocol steps in the preparation of samples for next generation sequencing. The quantification of bias remains to be determined in a number of protocol steps, namely DNA fractionation, blunting, phosphorylation, adapter ligation and library enrichment.
Results
We examined the experimental metadata of the Sequence Read Archive (SRA), a public repository in order to ascertain the level of annotation of important sequencing steps in submissions to the database. Using SQL relational database queries (using the SRAdb SQLite database generated by the Bioconductor consortium) to search for keywords that commonly occur in key preparatory protocol steps (fragmentation, ligation and enrichment) partitioned over studies, we found that 7.10%, 5.84% and 7.57% of all records, respectively, had at least one keyword corresponding to one of the three protocol steps. Only 4.06% of all records, partitioned over studies, had keywords for all three protocol steps (5.58% of all SRA records).
Conclusions
The current level of annotation in the SRA inhibits systematic studies of bias due to these protocol steps. Downstream from this, meta-analyses and comparative studies based on this data will have a source of bias that at present cannot be quantified.
The work-flow for the production of high-throughput sequencing data from nucleic acid samples is a complex one. There are a series of protocol steps in the preparation of samples for next generation sequencing. The quantification of bias remains to be determined in a number of protocol steps, namely DNA fractionation, blunting, phosphorylation, adapter ligation and library enrichment.
Results
We examined the experimental metadata of the Sequence Read Archive (SRA), a public repository in order to ascertain the level of annotation of important sequencing steps in submissions to the database. Using SQL relational database queries (using the SRAdb SQLite database generated by the Bioconductor consortium) to search for keywords that commonly occur in key preparatory protocol steps (fragmentation, ligation and enrichment) partitioned over studies, we found that 7.10%, 5.84% and 7.57% of all records, respectively, had at least one keyword corresponding to one of the three protocol steps. Only 4.06% of all records, partitioned over studies, had keywords for all three protocol steps (5.58% of all SRA records).
Conclusions
The current level of annotation in the SRA inhibits systematic studies of bias due to these protocol steps. Downstream from this, meta-analyses and comparative studies based on this data will have a source of bias that at present cannot be quantified.
Original language | English |
---|---|
Pages (from-to) | 1-11 |
Number of pages | 11 |
Journal | GigaScience |
Volume | 4 |
Issue number | 1 |
Early online date | 9 May 2015 |
DOIs | |
Publication status | Published - Dec 2015 |
Keywords
- Annotation
- Sequencing
- Next-generation
- Ligation
- Fragmentation
- Enrichment
- Protocol
- Metadata
- Experiment