I often want to view a random sample of k
rows from a DataFrame rather than just the head/tail, for which I would use df.sample(frac=1.0).iloc[:k]
.
When I chain on .style
to this sample, the styler will only see the k
selected rows, and the resulting colour-mapping will be inaccurate as it only considers the sample.
How can I shuffle, sample, and style a DataFrame, whilst ensuring the styler uses all of the data?
import pandas as pd
import numpy as np
#Data for testing
df = pd.DataFrame({
'device_id': np.random.randint(200, 800, size=1000),
'normalised_score': np.random.uniform(0, 2, size=1000),
'severity_level': np.random.randint(-3, 4, size=1000),
})
#Inaccurate styling if I chain .style onto a sampled DataFrame:
df.sample(frac=1.0).iloc[:5].style.background_gradient(subset='severity_level', cmap='RdYlGn')
I am using a colourmap that roughly goes red-white-green over the range of severity_level
(-3, -2, -1, 0, +1, +2, +3). A value of 0 should therefore display as white, but it gets coloured red in the sample below:
The colouring should consider all severity_level
values, even though I only display a few randomly-selected rows.
You would need to pipe df
into the styler first, and then chain on .hide
, whereat you select a random subset of rows using .hide(df.sample(frac=1.0).index[k:])
.
.hide
doesn't take lambda
functions, so you can't shuffle before .style
and then access the shuffled DataFrame later in the chain.
#... data from OP
(
df
.style
.background_gradient(subset='severity_level', cmap='RdYlGn')
#Shuffle and select k indices (by hiding rows coming after k)
.hide(df.sample(frac=1.0).index[k:])
)
A value of 0 should therefore display as white, but it gets coloured red because the styler only gets part of the data
The styler now uses all values of severity_level
, irrespective of the sample displayed