Friday, January 2, 2015

MapReduce, I just understood it

Out of this link [1] (by IBM) I understood the essence of the MapReduce – first breaking down a massive workload into distributed or isolated chunks and conquering them in parallel, and the results from each chunk then collectively handled by a single process to generate the final result.

MapReduce is not a newly innovated idea, but rather it is for the first time to be abstracted and defined as a programming paradigm for the manner to deal with a problem by distributing workloads and then stacking up the intermediate results to achieve the final result.

Indeed, it is not a new innovation. We can easily find that pattern in real life. In addition to the census example described in the link, following the same pattern is a ranking example - in getting the ranking list a school first collects from classes, then processes in department level, and finally processes in the school level.

However, once MapReduce is identified as a programming paradigm for scaling and processing in a clustered facility environment, the components of the paradigm are clear. We now can apply the pattern in many ways to achieve scalability. It is just a matter of where we can think that a task can be breaking down into parallel processes and how to scale up for the final answer to the problem.

Now, given that clustered computing is the predominate strategy for large scale processing to ramp up performance, I understand why MapReduce is considered so powerful and gain wide popularity in solution architectures.

1. What is MapReduce?

No comments: