hadoopapache-crunch

How does Apache Crunch PTable collectValues work internally


I was going through some documentations related to HDFS architecture and Apache crunch PTable. Based on my understandings, when we generate PTable the data is internally stored across the Data nodes in HDFS.

This means, if I have PTable with <K1,V1>,<K2,V2>,<K1,V3>,<K3,V4>,<K2,V5> and two Data nodes D1 and D2 in HDFS. Let's say each data node has a capacity to hold 3 pairs. So D1 will hold <K1,V1>,<K2,V2>,<K1,V3> and D2 will hold <K3,V4>,<K2,V5>.

If I do collectValues on this PTable, I am internally running another map-reduce job to get these values from PTable and generate pairs of <K,Collection<V>>. So at the end I will have, <K1,Collection<V1,V3>>, <K2,Collection<V2,V5>> and <K3,Collection<V4>>. And again these pairs will be distributed to different data nodes.

Now, I have this doubt that how will the Collection values (V1,V3 of K1) be stored in the generated PTable? Will this data be distributed across the nodes too, i.e., will

or, V1 and V3 will stored in one node only.

If all the collection values for a key are stored in one node (not-distributed), then for large data sets, won't the processing on the collected values of each key become slow?


Solution

  • All values of the same key will be in one node. This is the concept of map reduce in general - and not of crunch. The reasoning is that you want all your items in one place - this is the localization you want to achieve.