I often want to view a random sample of k
rows from a DataFrame rather than just the head/tail, for which I would use df.sample(k)
.
When I chain on .style
to this sample, the styler will only see the k
selected rows, and the resulting colour-mapping will be inaccurate as it only considers the sample.
How can I shuffle, sample, and style a DataFrame, whilst ensuring the styler uses all of the data?
import pandas as pd
import numpy as np
#Data for testing
df = pd.DataFrame({
'device_id': np.random.randint(200, 800, size=1000),
'normalised_score': np.random.uniform(0, 2, size=1000),
'severity_level': np.random.randint(-3, 4, size=1000),
})
#Inaccurate styling if I chain .style onto a sampled DataFrame:
df.sample(5).style.background_gradient(subset='severity_level', cmap='RdYlGn')
I am using a colourmap that roughly goes red-white-green over the range of severity_level
(-3, -2, -1, 0, +1, +2, +3). A value of 0 should therefore display as white, but it gets coloured red in the sample below:
The colouring should consider all severity_level
values, even though I only display a few randomly-selected rows.
The .background_gradient
function accepts vmin
and vmax
arguments to define the range for the gradient. When these parameters are left unspecified, the minimum and maximum values are pulled from the data (or gmap) ref, but it is also possible to specify these values directly.
The appropriate gradient colours can be achieved in the sampled version, even when using .sample
on the DataFrame first, by passing the min/max values from the original DataFrame's 'severity_level' column to .background_gradient
.
k = 5
(
df
.sample(n=k)
.style
.background_gradient(
subset='severity_level',
cmap='RdYlGn',
vmin=df['severity_level'].min(),
vmax=df['severity_level'].max()
)
)