I'm trying to use the pandas_profiling package to automagically describe some data frames from inside Apaceh Zeppelin.
The code I'm running is:
%pyspark
import sys
print(sys.version_info)
import numpy as np
print("numpy: ", np.__version__)
import pandas as pd
print("pandas: ", pd.__version__)
import pandas_profiling as pp
print("pandas_profiling: ", pp.__version__)
from pandas_profiling import ProfileReport
df = spark.sql("SELECT * FROM database.table")
profile = ProfileReport(df, title="Report: table")
profile.to_widgets()
My result is:
sys.version_info(major=3, minor=6, micro=8, releaselevel='final', serial=0)
numpy: 1.19.5
pandas: 1.1.5
pandas_profiling: 3.1.0
Fail to execute line 19: profile.to_widgets()
Traceback (most recent call last):
File "/tmp/1662648724242-0/zeppelin_python.py", line 158, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 19, in <module>
File "/usr/local/lib/python3.6/site-packages/pandas_profiling/profile_report.py", line 414, in to_widgets
display(self.widgets)
File "/usr/local/lib/python3.6/site-packages/pandas_profiling/profile_report.py", line 197, in widgets
self._widgets = self._render_widgets()
File "/usr/local/lib/python3.6/site-packages/pandas_profiling/profile_report.py", line 315, in _render_widgets
report = self.report
File "/usr/local/lib/python3.6/site-packages/pandas_profiling/profile_report.py", line 179, in report
self._report = get_report_structure(self.config, self.description_set)
File "/usr/local/lib/python3.6/site-packages/pandas_profiling/profile_report.py", line 166, in description_set
self._sample,
File "/usr/local/lib/python3.6/site-packages/pandas_profiling/model/describe.py", line 56, in describe
check_dataframe(df)
File "/usr/local/lib/python3.6/site-packages/multimethod/__init__.py", line 209, in __call__
return self[tuple(map(self.get_type, args))](*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/pandas_profiling/model/dataframe.py", line 10, in check_dataframe
raise NotImplementedError()
NotImplementedError
Any way to work around this? Any hope of working around it from inside Zeppelin?
The NotImplementedError
is being raised from check_dataframe
: https://github.com/ydataai/pandas-profiling/blob/v3.1.0/src/pandas_profiling/model/dataframe.py#L10. check_dataframe
uses multimethod for enabling multiple argument dispatching to functions, which currently only supports Pandas DataFrames: https://github.com/ydataai/pandas-profiling/blob/v3.1.0/src/pandas_profiling/model/pandas/dataframe_pandas.py#L11. In the code snippet, you are supplying a Spark dataframe (the result from spark.sql(...)
), which there doesn't appear to be any registered function for dynamic dispatch. If you convert the Spark dataframe to a Pandas dataframe using the toPandas
method, it should call the correct check_dataframe
function:
%pyspark
import sys
print(sys.version_info)
import numpy as np
print("numpy: ", np.__version__)
import pandas as pd
print("pandas: ", pd.__version__)
import pandas_profiling as pp
print("pandas_profiling: ", pp.__version__)
from pandas_profiling import ProfileReport
df = spark.sql("SELECT * FROM database.table").toPandas()
profile = ProfileReport(df, title="Report: table")
profile.to_widgets()
Alternatively, you can try to register your own function for checking Spark dataframes i.e;
from pandas_profiling.model.dataframe import check_dataframe
from pyspark.sql import DataFrame as SparkDataFrame
@check_dataframe.register
def spark_check_dataframe(df: SparkDataFrame):
# do something here or just make it a `pass`
but downstream functions in the reporting logic may not be (and are likely not) compatible with Spark dataframes.
Another alternative if you wanted continue working with Spark dataframes due to the scale of the data or level of comfortability with the API, there is spark-df-profiling which is based on pandas profiling but built for handling Spark dataframes.