I am predicting latitude and longitude coordinates. When I predict for example the latitude coordinate, I want to compare this prediction to another variable which contains the cluster centroids of the clusters I made for the latitude and longitude. I want to return the cluster (which I have in another variable) of the cluster centroid closest to the predicted latitude coordinate. I do have the right setup due to another post on Stackoverflow, but I don't get the right cluster as an answer. Can someone help me to see what I did wrong?
I want the 'predclustertest' variable to contain the cluster (ClusterEnd) that belongs to the ClusterEndLatitudeCenter which is closest to the prediction of the latitude (predictions_test)
df <- dfTraining %>%
group_by(TripID) %>%
mutate(pred_cluster_test = case_when(ClusterEnd_LatitudeCenter == predictions_test ~
ClusterEnd[ClusterEnd_LatitudeCenter],TRUE ~ ClusterEnd[sapply(ClusterEnd_LatitudeCenter,
function(x) which.min(x - predictions_test))]))
This is what the data looks like:
structure(list(EndLatitude = c(38.26, 38.218, 38.255, 38.258,
38.213, 38.215), EndLongitude = c(-85.75, -85.754, -85.746, -85.751,
-85.751, -85.757), ClusterEnd = c(1, 4, 1, 5, 4, 4), ClusterEnd_LatitudeCenter = c(38.25629,
38.21723, 38.25629, 38.25322, 38.21723, 38.21723), ClusterEnd_LongitudeCenter = c(-85.74133,
-85.75955, -85.74133, -85.75783, -85.75955, -85.75955), predictions_test = c(`1` = 38.2407296518939,
`2` = 38.2326115950784, `3` = 38.2428487622735, `4` = 38.2449069816005,
`5` = 38.234314694847, `6` = 38.2347388488934), pred_cluster_test = c(38.25629,
38.21723, 38.25629, 38.25322, 38.21723, 38.21723)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
Provided that I understand correctly what is expected the following may work:
library(dplyr)
foo <- function(x, cluster_coords) {
mat <- cbind(x, cluster_coords)
distance <- apply(mat, MARGIN = 1, FUN = dist, method = "euclidean")
which.min(distance)
}
df %>%
mutate(
cluster_pred_test = ClusterEnd[
sapply(
predictions_test,
function(x) foo(x, ClusterEnd_LatitudeCenter)
)
]
) %>%
pull(cluster_pred_test)
[1] 5 4 5 5 4 4
You may want to edit this to include both your coordinates, and look into the dplyr::group_map
and dplyr::group_modify
functions which may help you achieve efficient, grouped operations.