rcluster-analysisk-meansunsupervised-learning

k-medoids: control same agreement on class label


I've a problem with the control of the pattern of two class labels (1 and 2) results in the classification task using k-medoids. I'd like to apply the cluster::clara in two areas (ID) g2 and g3 and the same classification label for both areas, in my example:

# Packages
library(cluster)
library(ggplot2)

my_ds <-read.csv("https://raw.githubusercontent.com/Leprechault/trash/main/class_areas_ds.csv")
str(my_ds)
# 'data.frame': 194789 obs. of  5 variables:
#  $ x  : num  426060 426060 426060 426060 426060 ...
#  $ y  : num  8217410 8217410 8217410 8217410 8217410 ...
#  $ ID : chr  "g2" "g2" "g2" "g2" ...
#  $ R  : num  0.455 0.427 0.373 0.463 0.529 ...
#  $ HUE: num  -0.00397 -0.00384 -0.0028 -0.00369 -0.00352 ..

# Classification based in `R` and `HUE` variables
res<-NULL
areas<-unique(my_ds$ID)
for(i in 1:length(areas)){
  my_ds_split<-my_ds[my_ds$ID==areas[i],]
  k.medoids.res<-cluster::clara(my_ds_split[,4:ncol(my_ds_split)], 2, metric ="manhattan")
  my_ds_split.F<-cbind(my_ds_split, class = k.medoids.res$clustering)
  my_ds_split.F$class<-ifelse(my_ds_split.F$class==1,0,1)
  res<-rbind(res,cbind(my_ds_split.F))
}
res<-as.data.frame(res)

# Plot the results
plots <- list()
for (g in 1:length(areas)) {
  my_ds_split_class<-res[res$ID==areas[g],]
plots[[g]] <- ggplot() +
  geom_point(data=my_ds_split_class, 
  aes(x=x, y=y, color=class)) +
  theme_void()
} 
plots[[1]]

p1

plots[[2]] 

p2

In the plots, the classification of the area g2 is the opposite of the g3 and make just only one classification with g2 and g3 dataset together is not an option, because I'm my original data set I have 90 thousand areas and my RAM memory is just 64GB.

Please, any help for me find any way to create the same agreement on classification between several areas?


Solution

  • There is a trick to it! You need to start always with the higher or lower values of the data set, just only put and remove then after the classification and works very well, in this case using the lower value in the variable R:

    library(dplyr)
      my_ds_split<-my_ds[my_ds$ID==areas[i],]
      min.start.value <- my_ds_split %>% 
        slice(which.min(R))
      my_ds_split <- rbind(min.start.value,my_ds_split)
      k.medoids.res<-cluster::clara(my_ds_split[,4:ncol(my_ds_split)], 2, metric ="manhattan")
      my_ds_split.F<-cbind(my_ds_split, class = k.medoids.res$clustering)
      my_ds_split.F<-my_ds_split.F[-c(1),]