Improving I/O Efficiency in Hadoop-Based Massive Data Analysis Programs

Kyong Ha Lee, Woo Lam Kang, Young Kyoon Suh

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Apache Hadoop has been a popular parallel processing tool in the era of big data. While practitioners have rewritten many conventional analysis algorithms to make them customized to Hadoop, the issue of inefficient I/O in Hadoop-based programs has been repeatedly reported in the literature. In this article, we address the problem of the I/O inefficiency in Hadoop-based massive data analysis by introducing our efficient modification of Hadoop. We first incorporate a columnar data layout into the conventional Hadoop framework, without any modification of the Hadoop internals. We also provide Hadoop with indexing capability to save a huge amount of I/O while processing not only selection predicates but also star-join queries that are often used in many analysis tasks.

Original languageEnglish
Article number2682085
JournalScientific Programming
Volume2018
DOIs
StatePublished - 2018

Fingerprint

Dive into the research topics of 'Improving I/O Efficiency in Hadoop-Based Massive Data Analysis Programs'. Together they form a unique fingerprint.

Cite this