Print

Code Watch: Map/Reduce, turn by turn



Larry O Brien
Email
December 18, 2012 —  (Page 1 of 2)
You've undoubtedly heard of Apache Hadoop, a framework for distributed processing, and the Map/Reduce pattern it has helped make famous. But "Map and Reduce" is a general pattern, not a framework-specific technology. "Map" means "Do some data processing to every element in your collection," "Reduce" means "Walk over your collection of data (such as that produced by the 'Map' step) and summarize or coalesce the results." For instance, "Map" all the photos you took on your vacation by taking a quick look at them and marking them as blurry or sharp. "Reduce" them by deleting the blurry ones.

Two things are important: first, make the Map data processing as discrete and rapid as possible. In the case of triaging photos, don't look at the first of 2,000 photos and decide whether to put it in the photo album you share with your friends, just decide whether it's blurry or not. The second important rule is to keep the "Reduce" step separate from the Map step. Maybe it turns out that the only photo you have of Bigfoot is a little blurry; value decisions are often hard to make without the context of the entire calculation.

This implies another non-obvious aspect of the Map/Reduce approach: Map/Reduce is really Map and Reduce and then Map some more and Reduce some more and then save that and return later to Map a little more, etc. With digital photography, success comes from an efficient and consistent manner to rate and tag your media, and then working with those Map/Reduced datasets for different projects (a "Highlights of Our Trip" album versus a "A Glimpse of Bigfoot" album).

The same principle holds true with Big Data: Even if you have a hunch about the ultimate answer you're trying to derive, it is more likely to emerge from incremental steps. Although ultimately you may rerun your entire calculation from scratch, it's something to avoid during the development stage: re-processing raw data over and over again rather than an already-Map/Reduced dataset is infuriating and wasteful. (On the other hand, you must bear in mind the reductions you've already applied and avoid re-deriving something you've already discarded.)



Related Search Term(s): Big Data, Hadoop, Map/Reduce

Pages 1 2 


Share this link: http://sdt.bz/37262
 
Most Read  Latest News  Resources

close
NEXT ARTICLE
Zeichick’s Take: Moving into Big Data mode
What you should look out for when you come to Big Data Techcon Read More...
 
 
 




News on Monday  more>>
Android Developer News  more>>
SharePoint Tech Report  more>>
Big Data TechReport  more>>

   
 
 

 


Download Current Issue
MAY 2013 PDF ISSUE

Need Back Issues?
DOWNLOAD HERE

Want to subscribe?