Comparative Analysis of MapReduce, Hive and Pig
From the last few decades it has been observed that technology is becoming advance at a faster rate. Growth of technology leads to generation of large amount of data. Processing of big amount of data is a huge problem for the organizations. So data analysis is very crucial for business intelligence. Raw data, unstructured data and semi-structured data is produced by them which is not stored in data warehouse. Traditional data models cannot store and process large amount of data. So we need such systems that are more flexible, scalable, fault-tolerance, compatible and cheap to process large amount of data. The Apache Foundation Software provides such a framework for us that is Hadoop Platform. It is specially mend for large data sets. Hadoop uses map reduce concept to process data in a cluster. Map Reduce has been designed for development of large scale, distributed and fault tolerant data processing applications. These applications are utilizing by governments and industrial organizations. In map reduce jobs are written that consist of a map and reduce functions and the framework of Hadoop. Hadoop handles all details of paralysing the work, schedules various parts of job on different- different nodes in cluster, monitors and recovers from various failures. This paper includes the working of hive and pig and shows how pig queries are more powerful and taking less time to perform map and reduce task.
Features of MapReduce
1. Simplicity of Development for Various Applications: the idea behind map reduce framework is rarely simple: no socket programming, no threading or any special synchronization logic, no special techniques to deal with large amount of data. The architecture of Hadoop takes care of all things. The main job of developers is to use functional programming concept to build data processing applications that operate on data at a time. Map reduce operate on records (it may be a single line, single block of data, single file) and produce intermediate key value pairs.The reduce function operate on these intermediate key value and processing all values that have same key together and give the final output.
2. Scalable: Since various tasks run parallel on different machines in a cluster they do not communicate with each other explicitly and do not share their state. Additional machines can be easily added to cluster and applications takes advantage of additional hardware.
3. Automatic Parallelization and Distribution of Work: Developers only focus on the map and reduce functions that process various records. The distribution of work among various nodes and splitting of a job into multiple tasks allthese responsibilities are taken care by Hadoop framework.
4. Fault Tolerance: It is the frame work that takes care of any failure that may occur during processing of data across many nodes in a cluster.
Get In Touch