I can't explain the behaviour of trim_mean()
in Scipy.stats
.
I learned that trimmed mean calculates the average of a series of numbers after discarding given parts of a probability distribution.
In the following example, I got the result as 6.1111
from scipy.stats import trim_mean
data = [1, 2, 2, 3, 4, 30, 4, 4, 5]
trim_percentage = 0.05 # Trim 5% from each end
result = trim_mean(sorted(data), trim_percentage)
print(f"result = {result}")
result = 6.111111111111111
However, I expect that 1 and 30 will be cut out, because they fall under the 5 percentile and above the 95 percentile.
When I do it manually:
import numpy as np
data = [1, 2, 2, 3, 4, 30, 4, 4, 5]
p5, p95 = np.percentile(data, [5, 95])
print(f"The 5th percentile = {p5}\nThe 95th percentile = {p95}")
trim_average = np.mean(list(filter(lambda x: x if p5 < x < p95 else 0, data)))
print(f"trimmed average = {trim_average}")
I got this:
The 5th percentile = 1.4
The 95th percentile = 19.999999999999993
trimmed average = 3.4285714285714284
Does this mean the trim_mean()
treats each number separately and assumes a uniform distribution? The proportiontocut
is explained as "Fraction to cut off of both tails of the distribution". But why it behaves like if the distribution were not considered?
The phrasing in the documentation should be more precise: it cuts a fraction of the observations in your sample. You have 9 values, and 5% of 9 values is 0.45 values. However, it can't cut off a fraction of a value. The documentation states that it
Slices off less if proportion results in a non-integer slice index
So in your case, zero values are cut from both ends before taking the mean.
import numpy as np
from scipy import stats
x = [1, 2, 2, 3, 4, 30, 4, 4, 5]
np.mean(x) # 6.111111111111111
stats.trim_mean(x, 0.05) # 6.111111111111111
You can verify that the result changes when proportiontocut
exceeds 1/len(data)
:
from scipy import stats
x = [1, 2, 2, 3, 4, 30, 4, 4, 5]
p = 1 / len(x)
eps = 1e-15
stats.trim_mean(x, p-eps) # 6.111111111111111
stats.trim_mean(x, p+eps) # 3.4285714285714284
This behavior appears to be consistent with the description of a trimmed mean on Wikipedia, at least:
This number of points to be discarded is usually given as a percentage of the total number of points, but may also be given as a fixed number of points... For example, given a set of 8 points, trimming by 12.5% would discard the minimum and maximum value in the sample: the smallest and largest values, and would compute the mean of the remaining 6 points.
SciPy does not have a function that trims based on percentiles (of which there are many conventions). For that, you'd need to write your own function, or perhaps there is such a function in another library.
Please consider opening an issue about improving the documentation.