I am new to gnuplot. I am also new to the Kernel Density Resampling using a Gaussian supported by gnuplot using “smooth kdensity”. I played with the gnuplot demo script provided below. I am trying to understand what the meaning of the second column is of $kdensity1. If I print $kdensity1 I get these values:
(...)
140.76 13.1663 13.1663 i
146.032 12.5092 12.5092 i
151.304 11.6501 11.6501 i
156.575 10.6298 10.6298 i
161.847 9.5347 9.5347 i
167.119 8.48325 8.48325 i
172.391 7.56657 7.56657 i
177.662 6.80631 6.80631 i
(...)
The first column seems to be the computed Gaussian kernel over the sample values defined by the second column of $viol1 provided by the randomized expression. But I am trying to understand how the second column of $kdensity1 is computed since it defines the “density” or the spread of the violin plot. And it seems the 20.0 constant from the $2/20.0 computing is assumed. But this value must surely be different given another sample set that has a different range. Thus, how is column 2 of $kdensity1 computed (or what is its relationship) and how can I find the constant (20) for computing the spreading?
nsamp = 3000
set print $viol1
do for [i=1:nsamp] {
y = (i%4 == 0) ? 300. + 70.*invnorm(rand(0)) \
: (i%4 == 1) ? 400. + 10.*invnorm(rand(0)) \
: 120. + 40.*invnorm(rand(0))
print sprintf(" 35.0 %8.5g", y)
}
unset print
set title "kdensity mirrored sideways to give a violin plot"
set table $kdensity1
plot $viol1 using 2:(1) smooth kdensity bandwidth 10. with filledcurves above y lt 9 title 'B'
unset table
set border 2
unset margins
unset xtics
set ytics nomirror rangelimited
set xrange [-1:5]
plot $kdensity1 using (3 + $2/20.):1 with filledcurve x=3 lt 9 notitle, '' using (3 - $2/20.):1 with filledcurve x=3 lt 9 notitle
You are referring to this violin plot demo. Check on Wikipedia Kernel_(statistics) and from gnuplot help kdensity
:
The smooth kdensity option generates and plots a kernel density estimate using Gaussian kernels for the distribution from which a set of values was drawn. Values are taken from the first data column, optional weights are taken from the second column. A Gaussian is placed at the location of each point and the sum of all these Gaussians is plotted as a function. To obtain a normalized histogram, each weight should be 1/number-of-points.
...
and have a look at the further minimized example.
$Data
requires only one column of datasmooth kdensity
the part (1)
is the weight of that datapoint, here constant value of 1
(because of the parentheses). As help kdensity
suggests, you can use (1/N)
instead, where N
is the number of points to get a normalized violin plot.with filledcurves above y lt 9 title 'B'
for plotting it to a table. This generates a unnecessary 3rd column which is identical to the 2nd column.As I understand, for each datapoint there will be a Gaussian kernel with a given width and the area 1. And you sum up all Gaussians which will give you the shape of the curve in the second column of $kdensity
.
Concerning $2/20.
, I guess, this factor 20.
in the example is just a scaling factor that the two violin plots do not collide in the graph.
In order to find out a reasonable scaling factor automatically, you could do a stats
to get the maximum of the second column of the datablock $kdensity
.
If you want to compare several violin plots you should take the maximum of all maxima as scaling factor.
And the 3 +
and 3 -
in the plot command is simply an x-offset where your violin plot is mirrored.
Script:
### violin plot
reset session
$Data <<EOD
1.0
2.0
4.0
1.0
2.0
2.0
EOD
N = |$Data|
set table $kdensity
plot $Data u 1:(1) smooth kdensity bandwidth 0.2
unset table
stats [*:*][*:*] $kdensity u 2
print STATS_max, STATS_min
set key noautotitle
set xrange[1:5]
set yrange[0:5]
plot $kdensity u (3 + $2/STATS_max):1 w filledcurve x=3 lt 9, \
'' u (3 - $2/STATS_max):1 w filledcurve x=3 lt 9
### end of script
Actually, you could shorten the plot command to a single line:
plot for [i=-1:1:2] $kdensity u (3 + i*$2/STATS_max):1 w filledcurve x=3 lt 9
Result: