pandassensitive-data

Pandas profiling sensitive data - looking to randomize sample data in profile report


I am working with sensitive data. Sample data in the profile report shows the first 5 rows from the dataset. If you are looking at a profile report for columns with first_name, last_name, and SSN, you can stitch together 5 people's PII.

I was able to suppress the Sample Data tab with:

 profile = ProfileReport(df, title="Profiling Report", samples={"head": 0, "tail": 0})

However, when you click More details the sample data (first 5 rows) is still displayed.

I was then able to suppress additional data in the report with:

 df.profile_report(sensitive=True)

This is swinging the pendulum too far in the other direction. The distribution of values and other key output is being masked.

Is there a way to simply have the sample data be 5 records selected at random?

Thank you!!!


Solution

  • No there isnt't. AS per their documentation, they only 2 sections - First and last records. You can configure how many records you want to shown but not how they are selected to be depicted (the sections are called First and Last).

    I would recommend asking for that feature if that's something that matters for what you are developing.