pythonpandasdataframeobjecttypes

Configure Sweetviz to force analyze object-type column without conversion


Consider the following short dataframe example:

df =  pd.DataFrame({'column1': [2, 4, 8, 0],
                   'column2': [2, 0, 0, 0],
                   'column3': ["test", 2, 1, 8]})

df.dtypes shows that the datatypes of the columns are:

column1     int64
column2     int64
column3    object

Obviously column3 is of type object since it has values of mixed types inside of it.

Now I would like to run sweetviz over this sample dataset to generate a reporting on the columns and their data:

import sweetviz as sv
report = sv.analyze(df)
report.show_notebook()

The problem is, Sweetviz seems to realise that my column3 is mostly numbers even though it is of the type object. Now it is not generating the report but instead giving the following suggestion:

     Convert series [column3] to a numerical value (if makes sense):
     One way to do this is:
     df['column3'] = pd.to_numeric(df['column3'], errors='coerce')

Unfortunately for my usecase this isn't an option, because I want the report also to highlight wrongly used columns in my Data, so I want to treat the column as object even though only a small fraction of the values are not numbers.

I have played around with the parameters that sweetviz provides:

feature_config = sv.FeatureConfig(force_text=['column3'])
report = sv.analyze(df)
report.show_notebook()

For example I would expect sweetviz with this config to treat column3 as text and ignore the type detection implemented in sweetviz.

Unfortunately I get the same suggestion to convert the column to numeric and convert the string values to NaN.

I also tried the other possible parameters for column3 skip, force_cat, force_num. force_cat, force_num don't help at all leading to the same result. Skip leaves column3 out in the report which is also not a solution.

Any way to force sweetviz to leave the object-typed column3 as it is and analyze it? Can someone confirm, that this is a Feature of Sweetviz to check for column values data types?


Solution

  • object is ambiguous, you could have an object column and only integers in it. It seems that sweetviz is doing some "smart" checks to try to validate/infer dtypes.

    I would suggest to convert explicitly to category:

    import sweetviz as sv
    
    report = sv.analyze(df.astype({'column3': 'category'}))
    report.show_notebook()
    

    or to string:

    import sweetviz as sv
    
    report = sv.analyze(df.astype({'column3': 'str'}))
    report.show_notebook()