linuxhadoopmapreducebenchmarkingcpu-time

Which one should i use for benchmark tasks in hadoop user+sys time or total cpu time spent in hadoop job counter?


In this one statement in a answer written that "same job running over the same data but on one 20 node cluster, then a 200 node cluster.Overall, the same amount of CPU time will be used on both clusters" Can someone explain this ?

I've used time command to measure real time. Sometimes i got more cpu time (hadoop counter) than actual real time or vice versa.I know that real time measures actual clock time elapsed and it can be greater or lesser than user+sys.

I'm still not getting what total cpu times measures in hadoop. Regarding time command this answer written it is good to go with user+sys for benchmarks.

  1. As total cpu time taken by process = user+sys then it should be same as total cpu time of hadoop job counter. But i'm getting different results.
  2. Which time should i consider if i'm doing some benchmark kind of tasks in hadoop user+sys or total cpu time spent (hadoop counter)?

note: In apache hive benchmark they have considered real time but it can affected by other processes also. So i can not consider real time.


Solution

  • same job running over the same data but on one 20 node cluster, then a 200 node cluster.Overall, the same amount of CPU time will be used on both clusters

    This means if a job takes N hour on a 20-node cluster, and M hours on a 200-node cluster, then 20 * N should be equal to M * 200

    real time should be your choice, but as your said above, this value may change accordingly, so you should try at least 3 times, and calculate the average as the final result.