I'm trying to find the diameter at 50% of the total particles. So I was able to use this code to find the diameter size at 50% of the total particles. So the first column is the diameter size, the next column is the number of particles for each diameter size. The last column is the sum which is the sum of particles as the diameter size decreases. In this example, the total particles is equal to 81394 and the 50% of particles will equal 40697. So I want to find the diameter bin that 40697 particles are in. Example Link
smps<-read_csv("C:/Users/missl/Desktop/smps.csv")
transpose <- t(smps)
transpose <- as.data.frame(transpose)
rev_data_frame <- rev(transpose)
rev_data_frame <- t(rev_data_frame)
smps_1 <- as.data.frame(rev_data_frame)
smps_1$sum <- cumsum(smps_1[, 2])
smps_2 <- smps_1[which.min(abs((smps_1[51,3]/2)> smps_1$sum)),]
# dput(smps_1)
structure(list(Diameter = c(1000, 891, 794, 708, 631, 562, 501,
447, 398, 355, 316, 282, 251, 224, 200, 178, 158, 141, 126, 112,
100, 89.1, 79.4, 70.8, 63.1, 56.2, 50.1, 44.7, 39.8, 35.5, 31.6,
28.2, 25.1, 22.4, 20, 17.8, 15.8, 14.1, 12.6, 11.2, 10, 8.91,
7.94, 7.08, 6.31, 5.62, 5.01, 4.47, 3.98, 3.61, 3.55), Particle = c(6.3743,
5.1308, 12.0985, 5.4748, 9.823, 21.8273, 37.5763, 51.2838, 59.5292,
110.409, 158.339, 186.517, 273.547, 358.379, 454.531, 583.11,
757.914, 859.94, 970.185, 1142.2, 1335.7, 1427, 1552.13, 1651.4,
1679.82, 1606.73, 1411.52, 1305.51, 1266.84, 1309.77, 1477.73,
1938.36, 2333.89, 2710.21, 3080.09, 3663.15, 4158.74, 4379.57,
4005.18, 3426.03, 3795.27, 4143.3, 4466.91, 3890.09, 3297.93,
3111.45, 2229.47, 1015.87, 1379.63, 898.331, 1382.19), sum = c(6.3743,
11.5051, 23.6036, 29.0784, 38.9014, 60.7287, 98.305, 149.5888,
209.118, 319.527, 477.866, 664.383, 937.93, 1296.309, 1750.84,
2333.95, 3091.864, 3951.804, 4921.989, 6064.189, 7399.889, 8826.889,
10379.02, 12030.42, 13710.24, 15316.969, 16728.49, 18034, 19300.84,
20610.61, 22088.34, 24026.7, 26360.59, 29070.8, 32150.89, 35814.04,
39972.78, 44352.35, 48357.53, 51783.56, 55578.83, 59722.13, 64189.04,
68079.13, 71377.06, 74488.51, 76717.98, 77733.85, 79113.48, 80011.81,
81394)), class = "data.frame", row.names = c(NA, -51L))
I'm having trouble figuring out how to expand this code to other columns. Currently, I have only been able to figure out to get the diameter column and the sum of particles as the diameter size decreases. So ultimately, I want to find the diameter for each column that it is 50% of the total particles. So for column V1 it will be 14.1 nm, V2 would be 14.1 nm and V3 will be 15.80 nm.
#dput(loopsmps6)
structure(list(Diameter = c(1000, 891, 794, 708, 631, 562, 501,
447, 398, 355, 316, 282, 251, 224, 200, 178, 158, 141, 126, 112,
100, 89.1, 79.4, 70.8, 63.1, 56.2, 50.1, 44.7, 39.8, 35.5, 31.6,
28.2, 25.1, 22.4, 20, 17.8, 15.8, 14.1, 12.6, 11.2, 10, 8.91,
7.94, 7.08, 6.31, 5.62, 5.01, 4.47, 3.98, 3.61, 3.55), V1 = c(1382.19,
898.331, 1379.63, 1015.87, 2229.47, 3111.45, 3297.93, 3890.09,
4466.91, 4143.3, 3795.27, 3426.03, 4005.18, 4379.57, 4158.74,
3663.15, 3080.09, 2710.21, 2333.89, 1938.36, 1477.73, 1309.77,
1266.84, 1305.51, 1411.52, 1606.73, 1679.82, 1651.4, 1552.13,
1427, 1335.7, 1142.2, 970.185, 859.94, 757.914, 583.11, 454.531,
358.379, 273.547, 186.517, 158.339, 110.409, 59.5292, 51.2838,
37.5763, 21.8273, 9.823, 5.4748, 12.0985, 5.1308, 6.3743), V2 = c(1380.47,
213.662, 1945.19, 2062.84, 2692.5, 3274, 3736.01, 4301.49, 4792.09,
4626.88, 5150.13, 6270.89, 6731.52, 6553.28, 5123.29, 4635.26,
4478.81, 3869.76, 3231.69, 2552.12, 1922.94, 1662.09, 1513.65,
1427.39, 1411.19, 1574.37, 1730.65, 1725.3, 1578.94, 1413.3,
1308.7, 1209.57, 1046.64, 838.961, 725.613, 535.499, 432.743,
355.892, 257.469, 206.747, 150.875, 115.105, 85.9249, 38.3227,
30.5767, 28.3947, 19.1909, 5.5532, 8.9367, 6.5134, 2.6135), V3 = c(1365.3,
1082.56, 1485.89, 1486.07, 1854.67, 2457.91, 3456.4, 3481.57,
3272.76, 3598.13, 4004.16, 4480.2, 5368.56, 5734.28, 4916.71,
4610.23, 4441.32, 3655.14, 2915.26, 2380.36, 2821.82, 2313.77,
1942.49, 1732.05, 1471.91, 1637.06, 1691.77, 1626.86, 1512.16,
1466.37, 1275.49, 1152.71, 1007.67, 821.176, 659.546, 584.179,
456.626, 340.22, 279.581, 198.135, 165.915, 130.029, 83.975,
35.9218, 17.6912, 18.6721, 14.289, 3.4856, 12.5274, 3.2755, 2.8518
), V4 = c(164.171, 135.322, 630.374, 1421.33, 2090.21, 2529.67,
2567.1, 4216.73, 6077.57, 5994.51, 6616.98, 7851.76, 7280.47,
6554.56, 5801.08, 5730.74, 5781.37, 4454.9, 3352.84, 2660.91,
2259.52, 2082.14, 1949.51, 1774.88, 1534.58, 1650.44, 1624.44,
1601.32, 1574.22, 1460.39, 1370.78, 1227.98, 1047.13, 840.175,
617.41, 564.059, 440.58, 329.58, 288.765, 196.878, 167.035, 114.322,
61.5434, 63.1678, 33.1914, 21.8862, 14.2781, 7.3284, 10.9794,
7.4324, 3.9122), V5 = c(935.992, 981.076, 1343.07, 2085.9, 2688.29,
3198.92, 3554.72, 4366.22, 5170.72, 5025.25, 5299.9, 5939.16,
6302.48, 6363.15, 5824.72, 4750.64, 3525.67, 2888.71, 2346.75,
1995.1, 1893.54, 1638.47, 1514.59, 1483.24, 1439.87, 1610.69,
1677.33, 1686.02, 1618.14, 1426.04, 1245.07, 1145.19, 1036.26,
890.568, 751.837, 567.525, 464.186, 375.798, 259.373, 217.886,
152.946, 105.769, 76.1406, 50.6423, 34.3214, 22.6856, 12.8905,
5.3805, 4.7017, 4.5505, 3.2585)), row.names = c(NA, 51L), class = "data.frame")
[![enter image description here][4]][4]
Without really understanding your analysis code, I understand:
You have a sequence of observations, each observation is one particle.
Each row in your dataset is an aggregation of observations. For humans, this is nice, for R this is not the most comfortable data format. R wants one row to equal one observations (=particle). By that I mean a long format, where the diameter is printed explicitly for each particle, even though this means many duplicate diameters.
After this, it should be easy to apply the quantile()
function. Because I think that's what you are looking for when you say: "I want the diameter
bin that 50% of the particles are [in or below]". So you want the 50%-quantile, aka the median. There even is the median()
function for that.
Using the data you provided:
# Using dput() of smps_1 from question
# Step 1: Expand dataset into longer format
# This is dirty: rounding Particle because it is not a whole number
# But the overall outcome of n = 81394 rows is equal to the cumsum end value, so it cannot be that bad.
long_smps <- tidyr::uncount(smps_1, weights = round(Particle))
# Step 2: Compute Median
median(long_smps$Diameter)
#> 14.1
# Same as manually computed by OP
We get the same 50%-quantile of 14.1 as in your question! Now we can easily visualize the distribution of the variable as well with histograms.
Note that the rounding I apply is a dirty fix to accomodate the fact, that your particle counts are not whole numbers. So somewhere, we are overlooking something which is reflected by this rounding. This means, maybe this is not the correct solution for your data.
As of my understanding, in smps_6
the variables V1
to V5
contain particle counts for each diameter at 5 different measurement dates.
head(smps_6)
# Diameter V1 V2 V3 V4 V5
#1 1000 1382.190 1380.470 1365.30 164.171 935.992
#2 891 898.331 213.662 1082.56 135.322 981.076
#3 794 1379.630 1945.190 1485.89 630.374 1343.070
#4 708 1015.870 2062.840 1486.07 1421.330 2085.900
#5 631 2229.470 2692.500 1854.67 2090.210 2688.290
#6 562 3111.450 3274.000 2457.91 2529.670 3198.920
We again need to expand the diameter column. The goal is to have the diameters of each variable in a long format, e.g. for V1 the "uncounted" diameter sequence would look like this:
1000 1000 1000 ...
(repeated 1382.19 times) followed by 891 891 891 ...
(repeated 898.331 times) - and so on for all the other rows.
This is what tidyr::uncount()
accomplishes.
You see here why I am rounding, because you cannot repeat the diameter 1000 for .19 times. Either there was a counted observation or not.
The following answer ignores this problem and assumes it is OK to round to whole numbers.
We cannot expand all 5 variables into a long format if they are in the same data frame sharing the same diameter column. Because e.g. in V1, a diameter of 1000 is occurring 1382 times, in V2 - V5 this number is different.
So we have to create individual data frames, where there is just a combination of the Diameter
column and the specific V...
column containing the count data. We have to do this 5 times to accommodate the particle counts V1
to V5
.
I achieve this by using a for
loop.
This is my solution code. What it does:
smps_6
, thenDiameter
column and the current V...
column# allocate storage for the results
results <- vector("double", length = 5)
# this is one of many ways to do it
# i chose iterating over 2:6 because your particle colums in smps_6 are at 2:6
for (i in 2:6) {
results[i - 1] <- smps_6 |>
# narrow down selection to only the important variables
# diameter column is always present
dplyr::select(Diameter, !!i) |>
# put the data in a longer format
# where each observation = 1 particle = 1 row
# the weights is always the count data of the current column
tidyr::uncount(weights = round(smps_6[[i]])) |>
# we don't need the count data any more, we only need the diameter
# we need it as a vector to pass it into median
# this equals median(dataframe$diameter) but is compatible with the pipe
dplyr::pull(Diameter) |>
median()
}
# recreate your variable names to clearly see which results belong where
names(results) <- paste0("V", 1:5)
results
# V1 V2 V3 V4 V5
# 224 224 200 224 224
Note that I am not getting the same results as you are: V1
is not 14.1 in my case. So clearly there is something wrong we have to investigate, but this is not the fault of the loop. The the median of the smps_6
data you dput()
(as loopsmps_6
, i shortened the name) is in fact 224 for V1