statisticsgnuplotkernel-density

Gnuplot: Meaning of the second column of smooth kdensity


I am new to gnuplot. I am also new to the Kernel Density Resampling using a Gaussian supported by gnuplot using “smooth kdensity”. I played with the gnuplot demo script provided below. I am trying to understand what the meaning of the second column is of $kdensity1. If I print $kdensity1 I get these values:

(...)
140.76 13.1663 13.1663 i
146.032 12.5092 12.5092 i
151.304 11.6501 11.6501 i
156.575 10.6298 10.6298 i
161.847 9.5347 9.5347 i
167.119 8.48325 8.48325 i
172.391 7.56657 7.56657 i
177.662 6.80631 6.80631 i
(...)

The first column seems to be the computed Gaussian kernel over the sample values defined by the second column of $viol1 provided by the randomized expression. But I am trying to understand how the second column of $kdensity1 is computed since it defines the “density” or the spread of the violin plot. And it seems the 20.0 constant from the $2/20.0 computing is assumed. But this value must surely be different given another sample set that has a different range. Thus, how is column 2 of $kdensity1 computed (or what is its relationship) and how can I find the constant (20) for computing the spreading?

nsamp = 3000 
set print $viol1
do for [i=1:nsamp] {
    y = (i%4 == 0) ? 300. +  70.*invnorm(rand(0)) \
      : (i%4 == 1) ? 400. +  10.*invnorm(rand(0)) \
      :              120. +  40.*invnorm(rand(0))
    print sprintf(" 35.0 %8.5g", y)
}
unset print

set title "kdensity mirrored sideways to give a violin plot"

set table $kdensity1
plot $viol1 using 2:(1) smooth kdensity bandwidth 10. with filledcurves above y lt 9 title 'B'
unset table

set border 2
unset margins
unset xtics
set ytics nomirror rangelimited

set xrange [-1:5]
plot  $kdensity1 using (3 + $2/20.):1 with filledcurve x=3 lt 9 notitle, '' using (3 - $2/20.):1 with filledcurve x=3 lt 9 notitle


Solution

  • You are referring to this violin plot demo. Check on Wikipedia Kernel_(statistics) and from gnuplot help kdensity:

    The smooth kdensity option generates and plots a kernel density estimate using Gaussian kernels for the distribution from which a set of values was drawn. Values are taken from the first data column, optional weights are taken from the second column. A Gaussian is placed at the location of each point and the sum of all these Gaussians is plotted as a function. To obtain a normalized histogram, each weight should be 1/number-of-points.
    ...

    and have a look at the further minimized example.

    As I understand, for each datapoint there will be a Gaussian kernel with a given width and the area 1. And you sum up all Gaussians which will give you the shape of the curve in the second column of $kdensity.

    Concerning $2/20., I guess, this factor 20. in the example is just a scaling factor that the two violin plots do not collide in the graph.

    In order to find out a reasonable scaling factor automatically, you could do a stats to get the maximum of the second column of the datablock $kdensity. If you want to compare several violin plots you should take the maximum of all maxima as scaling factor.

    And the 3 + and 3 - in the plot command is simply an x-offset where your violin plot is mirrored.

    Script:

    ### violin plot
    reset session
    
    $Data <<EOD
    1.0
    2.0
    4.0
    1.0
    2.0
    2.0
    EOD
    
    N = |$Data|
    set table $kdensity
        plot $Data u 1:(1) smooth kdensity bandwidth 0.2
    unset table
    
    stats [*:*][*:*] $kdensity u 2
    print STATS_max, STATS_min
    
    set key noautotitle
    set xrange[1:5]
    set yrange[0:5]
    
    plot $kdensity u (3 + $2/STATS_max):1 w filledcurve x=3 lt 9, \
                '' u (3 - $2/STATS_max):1 w filledcurve x=3 lt 9
    ### end of script
    

    Actually, you could shorten the plot command to a single line:

    plot for [i=-1:1:2] $kdensity u (3 + i*$2/STATS_max):1 w filledcurve x=3 lt 9
    

    Result:

    enter image description here