I am working with sensitive data. Sample data in the profile report shows the first 5 rows from the dataset. If you are looking at a profile report for columns with first_name
, last_name
, and SSN, you can stitch together 5 people's PII.
I was able to suppress the Sample Data tab with:
profile = ProfileReport(df, title="Profiling Report", samples={"head": 0, "tail": 0})
However, when you click More details
the sample data (first 5 rows) is still displayed.
I was then able to suppress additional data in the report with:
df.profile_report(sensitive=True)
This is swinging the pendulum too far in the other direction. The distribution of values and other key output is being masked.
Is there a way to simply have the sample data be 5 records selected at random?
Thank you!!!
No there isnt't. AS per their documentation, they only 2 sections - First and last records. You can configure how many records you want to shown but not how they are selected to be depicted (the sections are called First and Last).
I would recommend asking for that feature if that's something that matters for what you are developing.