pysparkpandas-profiling

Does ydata-profiling works in a Spark Envirnoment?


I need to analyze a huge table with approx 7 millions lines and 20 columuns. I can read data in a dataframe without using Spark, but I can't have enough memory for computation.

Does someone know if the package can work in distributed spark environment?

I read the docs at https://ydata-profiling.ydata.ai/docs/master/pages/integrations/pyspark.html but I can't understand if the package can only read data from a "spark dataframe" or it entirely works on spark. In the first case I think it doesn't solve my memory issue and I need to compute correlations so I can't use "minimal" option.


Solution

  • ydata-profiling does work with Spark.

    You only need to provide a pypsark Dataframe as input. HAve a look into their Databricks example: https://github.com/ydataai/ydata-profiling/tree/master/examples/integrations/databricks