I am calculating the Tukey outlier detection algorythm on a data set of prices.
The thing is that I need it to be calculated by group (another variable included in the same data set), which works perfectly fine with the aggregate
command up until I need to calculate a mean using only the data between percentile 5 to the median and one using only the data from the median to percentile 95.
As far as I know, the command goes this way: aggregate(doc$
x, by=list(doc$
group), FUN=mean, trim = 0.05)
, if the mean was trimmed symmetrically taking the upper and lower 5% (total 10%) from the data before printing the result.
I don't know how to go through the next steps where I need to calculate the upper and lower mean taking the median as a division point, still keeping the upper and lower 5% off.
medlow <- aggregate(doc1$`rp`, by=list(doc1$`Código Artículo`), FUN=mean,trim =c(0.05,0.5))
medup <- aggregate(doc1$`rp`, by=list(doc1$`Código Artículo`), FUN=mean,trim =c(0.5,0.95))
medtrunc <- aggregate(doc1$`rp`, by=list(doc1$`Código Artículo`), FUN=mean,trim = 0.05)
I expect the output to be the number I need for each group, but it goes
Error in mean.default(X[[i]], ...) : 'trim' must be numeric of length one.
First, I think you are using aggregate
and trim
the wrong way. 'trim' must be numeric of length one
means that you can only exclude a particular fraction of data from both upper and lower tails of the distribution:
df = data.frame(
gender = c(
"male","male","male","male","female","female","female", "female"
),
score = rnorm(8, 10, 2)
)
aggregate(score ~ gender, data = df, mean, trim = 0.1)
gender score
1 female 11.385263
2 male 9.954465
For the splitting based on the median and calculating trimmed mean for the split data, you can easily split your data frame by making a new variable MedianSplit
by a simple for
loop:
df$MedianSplit <- 0
for (i in 1:nrow(df)) {
if (df$score[i] <= median(df$score)) {
df$MedianSplit[i] = "lower"
} else {
df$MedianSplit[i] = "upper"
}
}
df
gender score MedianSplit
1 male 7.062605 lower
2 male 9.373052 upper
3 male 6.592681 lower
4 male 7.298971 lower
5 female 7.795813 lower
6 female 7.800914 upper
7 female 12.431028 upper
8 female 10.661753 upper
Then, use aggregate
to compute the trimmed means:
For data below than median (i.e., [0, 0.5]
)
aggregate(
score ~ gender,
data = df[ which(df$MedianSplit == "lower"), ],
mean, trim = 0.05
)
gender score
1 female 7.795813
2 male 6.984752
and for those above the median (i.e., [0.5, 1]):
aggregate( score ~ gender, data = df[ which(df$MedianSplit == "upper"), ], mean, trim = 0.05 )
gender score
1 female 10.297898
2 male 9.373052