Pages

Saturday, 14 June 2014

Map Reduce Concept


Map Reduce Concept

Let us consider an example


  • Suppose we have a file with some text in it.
  • Now suppose we want to find out whether how many words are present in a particular file having a specific length.
  • Suppose we want to know the no. of words present in a file having length 6
  • We can follow the above map reduce concept and map each word to a function that calculates its length
  • Words having similar length are grouped together
  • These groups are further passed on to the reduce functions .Reduce functions can also work parallely.
  • So imagine such parallel reduce functions working together and producing the count of words having length as key which was passed to that particular function
  • Considering the above example the output of reduce functions tells us that there are 3 words whose length is 6.
  • Similarly other reduce function may output the number of words having length say 6 or 8 or any number.






Why to use Hadoop Framework?

Let take an example where there are several files to be processed on a single machine.


  • Lets now assume that several files are being processed and one of the file is long enough to dominate the process and may eat up the process leaving other files unprocessed.
  • Hence we now decide to divide these files into chunks of equal size .This will ensure that all the files get same amount of processing time.
  • Being on a single machine we can achieve scalability by allotting each chunk to a thread.
  • Now imagine a larger data set ,such a huge data set may challenge the processing capacity of a single machine and this where Hadoop framework comes into picture.
  • Although paralleling is feasible,its tough practicing it
  • Hadoop Framework may help us achieve distributed processing huge data sets by implementing several commodity hardwares within a cluster.

Hadoop framework supports a programming model to achieve parallelism and this programming model is MapReduce.
MapReduce helps to analyse large scale data having enough number of machines deployed.