Data Skew Handling in Heterogeneous Hadoop Cluster

International Journal of Distributed and Cloud Computing

Volume 5 Issue 2

Published: 2017
Author(s) Name: Abhash Visoriya, Deepak Barade, Sunita Varma | Author(s) Affiliation: M.E. Scholar, Department of Computer Engineering, SGSITS Indore, Madhya Pradesh, India.

Locked

Subscribed

Available for All

Abstract

Map reduce has been accepted as an important distributed processing model for processing big data which is generated by data intensive applications. Hadoop is an open source implementation which uses map reduce as programming model for processing big data. There are various programming tools for processing data but most of them are not suitable for processing big data. In the current hadoop implementation it is assumed that all nodes are homogeneous in nature. Homogeneous means all nodes have same computation power. But in the practical scenario it is possible to have heterogeneous environment. We can improve the performance of map reduce by considering heterogeneous environment. The second issue while processing the data with MapReduce framework is data skew. Uneven distribution of the data to each task is called data skew. so when data skew arises in system, then the tasks with skewed data take much longer time to complete compare than other tasks, this leads to performance degradation of the overall system. In this paper we will focus on how data should be placed across all the nodes and how to process the data by taking into consideration the data skew problem so that we will get maximum performance from given resources. Our data handling strategy distributes and processes the data in such a way that we get the maximum performance from each node which in turn increases the overall performance of map reduce.

Keywords: Data Skew, HDFS, Hadoop, MapReduce, Heterogeneous

View PDF

Welcome Guest

Data Skew Handling in Heterogeneous Hadoop Cluster

International Journal of Distributed and Cloud Computing

Volume 5 Issue 2

Abstract