I have the followinf DF and I want to create a dummy with automated scale to represent categorically whether a city has little, medium, or a lot of companies.
cities | sum of companies |
---|---|
CTY A | 199 |
CITY B | 358 |
CITY C | 250 |
CITY D | 1265 |
CITY E | 610 |
I tried the following code:
#install.packages("scales")
library(scales)
COMP_SCALES<- breaks_extended() #from packages Scales
COMP_A<-COMP_SCALES(df[2], n =4)
COMP_A <- cut(df[2],
breaks=c(-Inf, COMP_A[2],COMP_A[3],COMP_A[4], Inf),
labels=c("LITTLE","MEDIUM","A LOT OF","+ A LOT OF"))
However, the automatic calculated scale is not very suitable, once all the cities are on little range. How can I better automate this code?
The final porpuse is to create a table to better visualize the result with something like this:
COMP_A_CLUSTER <- as.data.frame.matrix(table(COMP_A,kmeans.k$cluster))
Expected outcome: City A Should be placed on the "Little". City B and C Should be placed on the "Medium". City E Should be placed on the "a lot of". City D should be placed on the "+ a lot of".
I have a list of more than 10,000 cities and more than 100 columns to do such a similar process and that is why I wanted the scale of the dummies to be calculated automatically.
You can write your own functions if you know what are the end (right) boundaries of each of the categories. Below is a simple example. DF has a new column 'CatCities' and has what you are seeking.
Following assumptions are there
DF <- read.csv("./SomeDF.csv")
ClassifyRange <- function(x, CategoryList=c("Little","Medium","a lof of","+a lot of"),EndPoints=c(250,500,1000,10000)){
Index <- which((EndPoints -x) >= 0)
return(CategoryList[Index[1]])
}
DF$CatCities <- lapply(DF$sum.of.companies, FUN=ClassifyRange)
It produces the following output