hadoopmapreduce

What exactly does Data Locality mean in Hadoop?


Data locality as defined by many Hadoop tutorial sites (i.e. https://techvidvan.com/tutorials/data-locality-in-hadoop-mapreduce/) states that: "Data locality in Hadoop is the process of moving the computation close to where the actual data resides instead of moving large data to computation. This minimizes overall network congestion."

I can understand having the node where the data resides process the computation for those data, instead of moving data around, would be efficient. However, what does it mean by "moving the computation close to where the actual data resides"? Does this mean that if the data sits in a server in Germany, it is better to use the server in France to do the computation on those data instead of using the server in Singapore to do the computation since France is closer to Germany than Singapore?


Solution

  • I +1 Dennis Jaheruddin's answer, and just wanted to add -- you can actually see different locality levels in MR when you check job counters, in Job History UI for example.

    enter image description here

    HDFS and YARN are rack-aware so its not just binary same-or-other node: in the above screen, Data-local means the task was running local to the machine that contained actual data; Rack-local -- that the data wasn't local to the node running the task and needed to be copied, but was still on the same rack; and finally the Other local case -- where the data wasn't available local, nor on the same rack, so it had to be copied over two switches to the node that run the computation.