pythonmatplotlibseabornhistogramkernel-density

Plotting weighted histograms with weighted KDE (kernel density estimate)


I want to plot two distributions of data as weighted histograms with weighted kernel density estimate (KDE) plots, side by side.

The data (length of DNA fragments, split by categorical variable regions) are integers in (0, 1e8) interval. I can plot the default, unweighted, histograms and KDE without a problem, using the python code below. The code plots histograms for the tiny example of the input data in testdata variable. See the unweighted (default) histograms below.

I want to produce a different plot, where the data in the histograms are weighted by length (= the X axis numeric variable). I used weights option (seaborn.histplot — seaborn documentation):

weights : vector or key in data
If provided, weight the contribution of the corresponding data points towards the count in each bin by these factors.

The histograms changed as expected (see weighted histograms plots below). But the KDE (kernel density estimate) lines did not change.

Question: How can I change the kernel density estimate (KDE) to reflect the fact that I am using weighted histograms?


Unweighted (default) histograms:

Unweighted (default) histograms


Weighted histograms:

Weighted histograms


Code with the minimal reproducible example:

import io
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

def plot_restriction_digest(df, out_file_base, weights):
     
    # Prevent the python icon from showing in the dock when the script is
    # running:
    matplotlib.use('Agg')
    
    sns.set_theme(style='ticks')
    f, ax = plt.subplots(figsize=(7, 5))
    sns.despine(f)

    hist = sns.histplot(data=df,
                        x='length',
                        hue='regions',

                        weights=weights,
                        
                        # Normalize such that the total area of the histogram
                        # equals 1:
                        stat='density',
                        
                        # stat='count',
                        
                        # Make all histograms visible, otherwise 'captured'
                        # regions histogram is much smaller than 'all' regions
                        # one:
                        common_norm=False,
                        
                        # Default plots too many very thin bins, which are poorly
                        # visible in pdf format (OK in png). Note that 10 bins is
                        # too crude, and 1000 bins makes too many thin bins:
                        bins=100,
                        
                        # X axis log scale:
                        log_scale=True,
                        
                        # Compute a kernel density estimate to smooth the
                        # distribution and show on the plot as lines:
                        kde=True,
                        )
    sns.move_legend(hist, 'upper left')
    plt.savefig(f'{out_file_base}.pdf')
    return

testdata="""
1   all
1   all
2   all
2   all
2   all
3   all
4   captured
4   captured
5   captured
5   captured
5   captured
8   captured
"""

# Default histograms:
df = pd.read_csv(io.StringIO(testdata), sep='\s+', header=None, names='length regions'.split())
plot_restriction_digest(df, 'test_tiny', None)

# Weighted histograms:
df = pd.read_csv(io.StringIO(testdata), sep='\s+', header=None, names='length regions'.split())
plot_restriction_digest(df, 'test_tiny_weighted', 'length')

print('Done.')

Notes:

  1. The two distributions of data are DNA fragment lengths for two types of genomic regions: "all" and "captured", but this is irrelevant to this specific question.
  2. The minimal reproducible example illustrates the question. The real data frame has tens of millions of rows, so the histograms and KDE plots are much more smooth an meaningful. The actual data need the X axis to be log-transformed to better tell the two broad distributions apart.
  3. I am using these packages and versions:
Python 3.11.6

matplotlib-base           3.8.2           py311hfdba5f6_0    conda-forge
numpy                     1.26.3          py311h7125741_0    conda-forge
pandas                    2.2.0           py311hfbe21a1_0    conda-forge
seaborn                   0.13.1               hd8ed1ab_0    conda-forge
seaborn-base              0.13.1             pyhd8ed1ab_0    conda-forge

Solution

  • Histograms and kde plots with a very small sample size often aren't a good indicator for how things behave with more suitable sample sizes. In this case, the default bandwidth seems too wide for this particular dataset. The bandwidth can be scaled via the bw_adjust parameter of the kdeplot. When creating a histplot, parameters ("keywords") for the kde can be provided via the kde_kws dictionary.

    In my test, to both test weights and a log scale, I used a few x values, and gave each a weight of 1, except for two of them, giving them a really high weight.

    A log scale can be mimicked by taking the log of the x values. The axis will show the log values, which will be less intuitive than the original values. For small data sets and integer weights, np.repeat can create a dataset with every value repeated. This doesn't work for very large datasets due to memory constraints.

    To quickly test settings and behavior for the real dataset, waiting time can be reduced by taking a random sample (e.g. df.sample(10000)).

    The test code below indicates that both the weights and the log scale seem to work as expected. One difference is that the default bandwidth is different in both cases; a different bw_scale is used to compensate.

    import matplotlib.pyplot as plt
    import seaborn as sns
    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame({'x': [1, 10, 20, 30, 40, 50, 60],
                       'w': [1, 10, 1, 1, 1, 10, 1]})
    fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(9, 6))
    
    sns.histplot(df, x='x', bins=10, weights='w', kde=True, log_scale=True, stat='density',
                 kde_kws={'bw_adjust': 0.3}, ax=ax1)
    ax1.set_xticks(df['x'].to_numpy(), df['x'].to_numpy())
    ax1.set_title('sns.histplot with log scale, weights and kde')
    
    sns.histplot(x=np.log(np.repeat(df['x'], df['w'])), bins=10, kde=True, stat='density',
                 kde_kws={'bw_adjust': 0.55}, ax=ax2)
    ax2.set_title('mimicing scale and weights')
    
    plt.tight_layout()
    plt.show()
    

    sns.histplot with weights, kde and log scale