I have to following data:
attributes <- c("apple-water-orange", "apple-water", "apple-orange", "coffee", "coffee-croissant", "green-red-yellow", "green-red-blue", "green-red","black-white","black-white-purple")
attributes
attributes
1 apple-water-orange
2 apple-water
3 apple-orange
4 coffee
5 coffee-croissant
6 green-red-yellow
7 green-red-blue
8 green-red
9 black-white
10 black-white-purple
What I want is another column, that assigns a category to each row, based on observation similarity.
category <- c(1,1,1,2,2,3,3,3,4,4)
df <- as.data.frame(cbind(df, category))
attributes category
1 apple-water-orange 1
2 apple-water 1
3 apple-orange 1
4 coffee 2
5 coffee-croissant 2
6 green-red-yellow 3
7 green-red-blue 3
8 green-red 3
9 black-white 4
10 black-white-purple 4
It is clustering in the broader sense, but I think most clustering methods are for numeric data only and one-hot-encoding has a lot of disadvantages (thats what I read on the internet).
Does anyone have an idea how to do this task? Maybe some word-matching approaches?
It would be also great if I could adjust degree of similarity (rough vs. decent "clustering") based on a parameter.
Thanks in advance for any idea!
So I have whipped up two possibilities. Option 1: uses "one-hot-encoding" which is simple and straight forward so long as apple/apples are equally different from apple/orange, for example. I use the Jaccard index for the distance metric because it does reasonably well with overlapping sets. Option 2: Uses a local sequence alignment algorithm and should be quite robust against things like apple/apples vs. apple/orange, it will also have more tuning parameters which could take time to optimize for your problem.
library(reshape2)
library(proxy)
attributes <- c("apple-water-orange", "apple-water", "apple-orange", "coffee",
"coffee-croissant", "green-red-yellow", "green-red-blue",
"green-red","black-white","black-white-purple")
dat <- data.frame(attr=attributes, row.names = paste("id", seq_along(attributes), sep=""))
attributesList <- strsplit(attributes, "-")
df <- data.frame(id=paste("id", rep(seq_along(attributesList), sapply(attributesList, length)), sep=""),
word=unlist(attributesList))
df.wide <- dcast(data=df, word ~ id, length)
rownames(df.wide) <- df.wide[, 1]
df.wide <- as.matrix(df.wide[, -1])
df.dist <- dist(t(df.wide), method="jaccard")
plot(hclust(df.dist))
abline(h=c(0.6, 0.8))
heatmap.2(df.wide, trace="none", col=rev(heat.colors(15)))
res <- merge(dat, data.frame(cat1=cutree(hclust(df.dist), h=0.8)), by="row.names")
res <- merge(res, data.frame(cat2=cutree(hclust(df.dist), h=0.6)), by.y="row.names", by.x="Row.names")
res
You'll see you can control the granularity of the categorization by adjusting where you cut the dendrogram.
Biostrings is part of the Bioconductor project. The SW algorithm finds the optimal local (non-end-to-end) alignment of two sequences (strings). In this case you can again use cutree
to set your categories but you can also tune the scoring function to suit your needs.
library(Biostrings)
strList <- lapply(attributes, BString)
swDist <- matrix(apply(expand.grid(seq_along(strList), seq_along(strList)), 1, function(x) {
pairwiseAlignment(strList[[x[1]]], strList[[x[2]]], type="local")@score
}), nrow = 10)
heatmap.2(swDist, trace="none", col = rev(heat.colors(15)),
labRow = paste("id", 1:10, sep=""), labCol = paste("id", 1:10, sep=""))