[SOLVED] Hive query shows few reducers killed but query is still running. Will the output be proper?

Hive query shows few reducers killed but query is still running. Will the output be proper?

I have a complex query with multiple left outer joins running for the last 1 hour in Amazon AWS EMR. But few reducers are shown as Failed and Killed.

My question is why do some reducers get killed? Will the final output be proper?

Solution

Usually each container has 3 attempts before final fail (configurable, as @rbyndoor mentioned). If one attempt has failed, it is being restarted until the number of attempts reaches limit, and if it is failed, the whole vertex is failed, all other tasks being killed.

Rare failures of some task attempts is not so critical issue, especially when running on EMR cluster with spot nodes, which can be removed during execution, causing failures and partial restarts of some vertices.

In most cases the reason of failures you can find in tracker logs.

And of course this is not the reason to switch to the deprecated MR. Try to find what is the root cause and fix it.

In some marginal cases when even if the job with some failed attempts succeeded, the data produced may be partially corrupted. For example when using some non-deterministic function in the distribute by clause. Like rand(). In this case restarted container may try to copy data produced by previous step (mapper), and the spot node with mapper results is already removed. In such case some previous step containers are restarted, but the data produced may be different because of non-deterministic nature of rand function.

About killed tasks.

Mappers or reducers can be killed because of many reasons. First of all when one of the containers has failed completely, all other tasks running are being killed. If speculative execution is switched on, duplicated tasks are killed, if the task is not responding for a long time, etc. This is quite normal and usually is not an indicator that something is wrong. If the whole job has failed or you have many attempts failures, you need to inspect failed tasks logs to find the reason, not killed ones.