Data Skew Handling in Heterogeneous Hadoop Cluster
Published: 2017
Author(s) Name: Abhash Visoriya, Deepak Barade, Sunita Varma |
Author(s) Affiliation: M.E. Scholar, Department of Computer Engineering, SGSITS Indore, Madhya Pradesh, India.
Locked
Subscribed
Available for All
Abstract
Map reduce has been accepted as an important
distributed processing model for processing big data
which is generated by data intensive applications. Hadoop
is an open source implementation which uses map reduce
as programming model for processing big data. There are
various programming tools for processing data but most of
them are not suitable for processing big data.
In the current hadoop implementation it is assumed that all
nodes are homogeneous in nature. Homogeneous means all
nodes have same computation power. But in the practical
scenario it is possible to have heterogeneous environment.
We can improve the performance of map reduce by
considering heterogeneous environment.
The second issue while processing the data with MapReduce
framework is data skew. Uneven distribution of the data to
each task is called data skew. so when data skew arises in
system, then the tasks with skewed data take much longer
time to complete compare than other tasks, this leads to
performance degradation of the overall system.
In this paper we will focus on how data should be placed
across all the nodes and how to process the data by taking
into consideration the data skew problem so that we will
get maximum performance from given resources. Our data
handling strategy distributes and processes the data in such
a way that we get the maximum performance from each
node which in turn increases the overall performance of
map reduce.
Keywords: Data Skew, HDFS, Hadoop, MapReduce, Heterogeneous
View PDF