Currently trying to write code to check for data quality of a 7 gb data file. I tried googling exactly but to no avail. Initially, the purpose of the code is to check how many are nulls/NaNs and later on to join it with another datafile and compare the quality between each. We are expecting the second is the more reliable but I would like to later on automate the whole process. I was wondering if there is someone here willing to share their data quality python code using Dask. Thank you
I would suggest the following approach: