pysparkpalantir-foundryfoundry-code-repositoriesfoundry-python-transform

Does a count() over a DataFrame materialize the data to the driver / increase a risk of OOM?


I want to run df.count() on my DataFrame, but I know my total dataset size is pretty large. Does this run the risk of materializing the data back to the driver / increasing my risk of driver OOM?


Solution

  • This will not materialize your entire dataset to the driver, nor will it necessarily increase your risk of OOM. (It forces the evaluation of the incoming DataFrame, so if that evaluation means you will OOM then that will be recognized at the point you .count(), but the .count() itself didn't cause this, it only made you realize it).

    What this will do, however, is halt the execution of your job from the point you make the call to. .count(). This is because this value must be known to the driver before it can proceed with any of the rest of your work, so it's not a particularly efficient use of Spark / distributed compute. Use .count() only when necessary, i.e. when making choices about partition counts or other such dynamic sizing operations.