Print

The ins and outs of MapReduce



Email
October 19, 2011 —  (Page 1 of 3)
MapReduce is a cost-effective way to process “Big Data” consisting of terabytes, petabytes or even exabytes of data, although there are also good reasons to use MapReduce in much smaller applications.

MapReduce is essentially a framework for processing widely disparate data quickly without needing to set up a schema or structure beforehand. The need for MapReduce is being driven by the growing variety of information sources and data formats, including both unstructured data (e.g. freeform text, images and videos) and semi-structured data (e.g. log files, click streams and sensor data). It is also driven by the need for new processing and analytic paradigms that deliver more cost-effective solutions.

MapReduce was popularized by Google as a means to deal effectively with Big Data, and it is now part of an open-source solution called Hadoop.

A brief history of Hadoop
Hadoop was created in 2004 by Doug Cutting (who named the project after his son’s toy elephant). The project was inspired by Google’s MapReduce and Google File System papers that outlined the framework and distributed file system necessary for processing the enormous, distributed data sets involved in searching the Web. In 2006, Hadoop became a subproject of Lucene (a popular text-search library) at the Apache Software Foundation, and then its own top-level Apache project in 2008.

Hadoop provides an effective way to capture, organize, store, search, share, analyze and visualize disparate data sources (unstructured, semi-structured, etc.) across a large cluster of computers, and it is capable of scaling from tens to thousands of commodity servers, each offering local computation and storage resources.

The map and reduce functions
MapReduce is a software framework that includes a programming model for writing applications capable of processing vast amounts of distributed data in parallel on large clusters of servers. The “Map” function normally has a master node that reads the input file or files, partitions the data set into smaller subsets, and distributes the processing to worker nodes. The worker nodes can, in turn, further distribute the processing, creating a hierarchical tree structure. As such, the worker nodes can all work on much smaller portions of the problem in parallel.



Related Search Term(s): Hadoop, MapReduce

Pages 1 2 3 


Share this link: http://sdt.bz/36028
 
Most Read Latest News Blog Resources

Add comment


Name*
Email*  
Country     


  • Comment
Loading




close
NEXT ARTICLE
Hadoop hits milestone 1.0 release
HBase and cluster managements tools are highlights of the new offering Read More...
 
 
 
 
News on Monday
more>>
SharePoint Tech Report
more>>


   

 
 

Download Current Issue
MAY 2012 PDF ISSUE

Need Back Issues?
DOWNLOAD HERE

Want to subscribe?


 
blogs tab
Creation
To write better software, cultivate your ability to be creative.
05/19/2012 07:40 PM EST

Slick...but who needs it?
compilr.com is a well-designed site and the folks behind it seem to have their heart in the right place. But...who needs it?
05/16/2012 12:45 PM EST

How to be a better software developer
Want to be a better developer? You won't get there by mastering an interesting language or learning a new set of APIs.
05/14/2012 12:18 PM EST

Wooing Galatea
Do yourself a favor and check out Galatea 2.2, a wonderful book by novelist Richard Powers.
05/12/2012 07:05 PM EST

The world as story
An artificial-intelligence system at Carnegie Mellon seeks to understand the world by making statements about it.
05/10/2012 06:39 AM EST

The Rise of the Brogrammer, or the Rise of the Sexist Programmer?
Women in Silicon Valley get vocal about sexist ads and campaigns that contribute to a tense work environment.
05/09/2012 03:14 PM EST

 

Events calendar tab
5/23/2012 to 5/24/2012
Chicago
IEG

6/3/2012 to 6/7/2012
Orlando
IBM Rational

6/10/2012 to 6/15/2012
Las Vegas
SQE

6/10/2012 to 6/15/2012
Las Vegas
SQE

6/11/2012 to 6/14/2012
Bellevue, Wash.
AMD