The application of Hadoop in structural bioinformatics

Jamie Alnasir; Hugh Shanahan

doi:10.1093/bib/bby106

The application of Hadoop in structural bioinformatics

Jamie Alnasir, Hugh Shanahan

Research output: Contribution to journal › Article › peer-review

73 Downloads (Pure)

Abstract

The paper reviews the use of the Hadoop platform in Structural Bioinformatics applications. Specifically we review a number of implementations using Hadoop of high-throughput analyses, e.g. ligand-protein docking and structural alignment, and their scalability in comparison with other batch schedulers and MPI. We find that these deployments for the most part use known executables called from MapReduce rather than rewriting the algorithms. The scalability exhibits a variable behaviour in comparison with other batch schedulers, particularly as direct comparisons on the same platform are generally not available. We note there is some evidence that MPI implementations scale better than Hadoop. A significant barrier to the use of the Hadoop ecosystem is the difficulty of the interface and configuration of a resource to use Hadoop. This will improve over time as interfaces to Hadoop e.g. Spark improve, usage of cloud platforms (e.g. Azure and AWS) increases and approaches such as the Workflow Definition Language are taken up.

Original language	English
Article number	bby106
Pages (from-to)	1-10
Number of pages	10
Journal	Briefings in Bioinformatics
DOIs	https://doi.org/10.1093/bib/bby106
Publication status	Published - 20 Nov 2018

Keywords

tructural Bioinformatics
Hadoop
Cloud computing

Access to Document

10.1093/bib/bby106

Accepted ManuscriptAccepted author manuscript, 248 KB

Cite this

@article{11c3e02673a443a2853e08759e387ba6,

title = "The application of Hadoop in structural bioinformatics",

abstract = "The paper reviews the use of the Hadoop platform in Structural Bioinformatics applications. Specifically we review a number of implementations using Hadoop of high-throughput analyses, e.g. ligand-protein docking and structural alignment, and their scalability in comparison with other batch schedulers and MPI. We find that these deployments for the most part use known executables called from MapReduce rather than rewriting the algorithms. The scalability exhibits a variable behaviour in comparison with other batch schedulers, particularly as direct comparisons on the same platform are generally not available. We note there is some evidence that MPI implementations scale better than Hadoop. A significant barrier to the use of the Hadoop ecosystem is the difficulty of the interface and configuration of a resource to use Hadoop. This will improve over time as interfaces to Hadoop e.g. Spark improve, usage of cloud platforms (e.g. Azure and AWS) increases and approaches such as the Workflow Definition Language are taken up.",

keywords = "tructural Bioinformatics, Hadoop, Cloud computing",

author = "Jamie Alnasir and Hugh Shanahan",

year = "2018",

month = nov,

day = "20",

doi = "10.1093/bib/bby106",

language = "English",

pages = "1--10",

journal = "Briefings in Bioinformatics",

issn = "1477-4054",

publisher = "Oxford University Press",

}

TY - JOUR

T1 - The application of Hadoop in structural bioinformatics

AU - Alnasir, Jamie

AU - Shanahan, Hugh

PY - 2018/11/20

Y1 - 2018/11/20

N2 - The paper reviews the use of the Hadoop platform in Structural Bioinformatics applications. Specifically we review a number of implementations using Hadoop of high-throughput analyses, e.g. ligand-protein docking and structural alignment, and their scalability in comparison with other batch schedulers and MPI. We find that these deployments for the most part use known executables called from MapReduce rather than rewriting the algorithms. The scalability exhibits a variable behaviour in comparison with other batch schedulers, particularly as direct comparisons on the same platform are generally not available. We note there is some evidence that MPI implementations scale better than Hadoop. A significant barrier to the use of the Hadoop ecosystem is the difficulty of the interface and configuration of a resource to use Hadoop. This will improve over time as interfaces to Hadoop e.g. Spark improve, usage of cloud platforms (e.g. Azure and AWS) increases and approaches such as the Workflow Definition Language are taken up.

AB - The paper reviews the use of the Hadoop platform in Structural Bioinformatics applications. Specifically we review a number of implementations using Hadoop of high-throughput analyses, e.g. ligand-protein docking and structural alignment, and their scalability in comparison with other batch schedulers and MPI. We find that these deployments for the most part use known executables called from MapReduce rather than rewriting the algorithms. The scalability exhibits a variable behaviour in comparison with other batch schedulers, particularly as direct comparisons on the same platform are generally not available. We note there is some evidence that MPI implementations scale better than Hadoop. A significant barrier to the use of the Hadoop ecosystem is the difficulty of the interface and configuration of a resource to use Hadoop. This will improve over time as interfaces to Hadoop e.g. Spark improve, usage of cloud platforms (e.g. Azure and AWS) increases and approaches such as the Workflow Definition Language are taken up.

KW - tructural Bioinformatics

KW - Hadoop

KW - Cloud computing

U2 - 10.1093/bib/bby106

DO - 10.1093/bib/bby106

M3 - Article

SN - 1477-4054

SP - 1

EP - 10

JO - Briefings in Bioinformatics

JF - Briefings in Bioinformatics

M1 - bby106

ER -