In this one statement in a answer written that "same job running over the same data but on one 20 node cluster, then a 200 node cluster.Overall, the same amount of CPU time will be used on both clusters" Can someone explain this ?
I've used time
command to measure real time. Sometimes i got more cpu time (hadoop counter) than actual real time or vice versa.I know that real time measures actual clock time elapsed and it can be greater or lesser than user+sys
.
I'm still not getting what total cpu times measures in hadoop. Regarding time command this answer written it is good to go with user+sys for benchmarks.
total cpu time taken by process = user+sys
then it should be same as total cpu time of hadoop job counter. But i'm getting different results. note: In apache hive benchmark they have considered real time but it can affected by other processes also. So i can not consider real time.
same job running over the same data but on one 20 node cluster, then a 200 node cluster.Overall, the same amount of CPU time will be used on both clusters
This means if a job takes N
hour on a 20-node cluster, and M
hours on a 200-node cluster, then 20 * N
should be equal to M * 200
real time should be your choice, but as your said above, this value may change accordingly, so you should try at least 3 times, and calculate the average as the final result.