Out of this link [1] (by IBM) I understood the essence of
the MapReduce – first breaking down a massive workload into distributed or
isolated chunks and conquering them in parallel, and the results from each
chunk then collectively handled by a single process to generate the final
result.
MapReduce is not a newly innovated idea, but rather
it is for the first time to be abstracted and defined as a programming paradigm for the
manner to deal with a problem by distributing workloads and then stacking up
the intermediate results to achieve the final result.
Indeed, it is not a new innovation. We can easily find that
pattern in real life. In addition to the census example described in the link, following the same pattern is a ranking example - in getting the ranking list a school first
collects from classes, then processes in department level, and finally processes
in the school level.
However, once MapReduce is identified as a programming
paradigm for scaling and processing in a clustered facility environment, the
components of the paradigm are clear. We now can apply the pattern in many ways
to achieve scalability. It is just a matter of where we can think that a task
can be breaking down into parallel processes and how to scale up for the final
answer to the problem.
Now, given that clustered computing is the predominate
strategy for large scale processing to ramp up performance, I understand why
MapReduce is considered so powerful and gain wide popularity in solution
architectures.
1. What is MapReduce?
No comments:
Post a Comment