pythonpython-3.xpandasapache-zeppelinpandas-profiling

NotImplementedError when calling pandas_profiling.ProfileReport.to_widgets() inside Apache Zeppelin


I'm trying to use the pandas_profiling package to automagically describe some data frames from inside Apaceh Zeppelin.

The code I'm running is:

%pyspark

import sys
print(sys.version_info)

import numpy as np
print("numpy: ", np.__version__)
import pandas as pd
print("pandas: ", pd.__version__)
import pandas_profiling as pp
print("pandas_profiling: ", pp.__version__)

from pandas_profiling import ProfileReport

df = spark.sql("SELECT * FROM database.table")

profile = ProfileReport(df, title="Report: table")

profile.to_widgets()

My result is:

sys.version_info(major=3, minor=6, micro=8, releaselevel='final', serial=0)
numpy:  1.19.5
pandas:  1.1.5
pandas_profiling:  3.1.0


Fail to execute line 19: profile.to_widgets()
Traceback (most recent call last):
  File "/tmp/1662648724242-0/zeppelin_python.py", line 158, in <module>
    exec(code, _zcUserQueryNameSpace)
  File "<stdin>", line 19, in <module>
  File "/usr/local/lib/python3.6/site-packages/pandas_profiling/profile_report.py", line 414, in to_widgets
    display(self.widgets)
  File "/usr/local/lib/python3.6/site-packages/pandas_profiling/profile_report.py", line 197, in widgets
    self._widgets = self._render_widgets()
  File "/usr/local/lib/python3.6/site-packages/pandas_profiling/profile_report.py", line 315, in _render_widgets
    report = self.report
  File "/usr/local/lib/python3.6/site-packages/pandas_profiling/profile_report.py", line 179, in report
    self._report = get_report_structure(self.config, self.description_set)
  File "/usr/local/lib/python3.6/site-packages/pandas_profiling/profile_report.py", line 166, in description_set
    self._sample,
  File "/usr/local/lib/python3.6/site-packages/pandas_profiling/model/describe.py", line 56, in describe
    check_dataframe(df)
  File "/usr/local/lib/python3.6/site-packages/multimethod/__init__.py", line 209, in __call__
    return self[tuple(map(self.get_type, args))](*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/pandas_profiling/model/dataframe.py", line 10, in check_dataframe
    raise NotImplementedError()
NotImplementedError

Any way to work around this? Any hope of working around it from inside Zeppelin?


Solution

  • The NotImplementedError is being raised from check_dataframe: https://github.com/ydataai/pandas-profiling/blob/v3.1.0/src/pandas_profiling/model/dataframe.py#L10. check_dataframe uses multimethod for enabling multiple argument dispatching to functions, which currently only supports Pandas DataFrames: https://github.com/ydataai/pandas-profiling/blob/v3.1.0/src/pandas_profiling/model/pandas/dataframe_pandas.py#L11. In the code snippet, you are supplying a Spark dataframe (the result from spark.sql(...)), which there doesn't appear to be any registered function for dynamic dispatch. If you convert the Spark dataframe to a Pandas dataframe using the toPandas method, it should call the correct check_dataframe function:

    %pyspark
    
    import sys
    print(sys.version_info)
    
    import numpy as np
    print("numpy: ", np.__version__)
    import pandas as pd
    print("pandas: ", pd.__version__)
    import pandas_profiling as pp
    print("pandas_profiling: ", pp.__version__)
    
    from pandas_profiling import ProfileReport
    
    df = spark.sql("SELECT * FROM database.table").toPandas() 
    
    profile = ProfileReport(df, title="Report: table")
    
    profile.to_widgets()
    

    Alternatively, you can try to register your own function for checking Spark dataframes i.e;

    from pandas_profiling.model.dataframe import check_dataframe
    from pyspark.sql import DataFrame as SparkDataFrame
    @check_dataframe.register
    def spark_check_dataframe(df: SparkDataFrame):
       # do something here or just make it a `pass`
    
    

    but downstream functions in the reporting logic may not be (and are likely not) compatible with Spark dataframes.

    Another alternative if you wanted continue working with Spark dataframes due to the scale of the data or level of comfortability with the API, there is spark-df-profiling which is based on pandas profiling but built for handling Spark dataframes.