I'm trying to create a simple custom recommendation system for a web site. So, as input information I have user/session-id,item-id,item-price
which users clicked at.
c165c2ee-81cf-48cf-ba3f-83b70204c00c 161785 124.0
a886fdd5-7cee-4152-b1b7-77a2702687b0 643339 42.0
5e5fd670-b104-445b-a36d-b3798cd43279 131332 38.0
888d736f-99bc-49ca-969d-057e7d4bb8d1 1032763 39.0
I would like to apply cluster analysis to that data.
If I try to apply k-means clustering to my data:
> q <- kmeans(dat, centers=25)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In kmeans(dat, centers = 25) : NAs introduced by coercion
If I try to apply hierarchial clustering to the data:
> m <- as.matrix(dat)
> d <- dist(m) # find distance matrix
Warning message:
In dist(m) : NAs introduced by coercion
The "NAs introduced by coercion" seems to happen as a first column is not a number. So, I've tried to run the code against dat[-1]
but result is the same.
What am I missing or doing wrong?
Output on str and factor:
> str(dat)
'data.frame': 14634 obs. of 3 variables:
$ V3 : Factor w/ 10062 levels "000880bf-6cb7-4c4a-9a9d-1c0a975b52ba",..: 7548 6585 3670 5336 9181 6429 62 410 7386 9409 ...
$ V8 : Factor w/ 5561 levels "1000120","1000910",..: 835 3996 443 65 1289 2084 582 695 3666 4787 ...
$ V12: Factor w/ 395 levels "100.0","101.0",..: 25 278 249 256 352 249 1 88 361 1 ...
> dat[,1] = factor(dat[,1])
> str(dat)
'data.frame': 14634 obs. of 3 variables:
$ V3 : Factor w/ 10062 levels "000880bf-6cb7-4c4a-9a9d-1c0a975b52ba",..: 7548 6585 3670 5336 9181 6429 62 410 7386 9409 ...
$ V8 : Factor w/ 5561 levels "1000120","1000910",..: 835 3996 443 65 1289 2084 582 695 3666 4787 ...
$ V12: Factor w/ 395 levels "100.0","101.0",..: 25 278 249 256 352 249 1 88 361 1 ...
> dd <- dist(dat)
Warning message:
In dist(dat) : NAs introduced by coercion
> hc <- hclust(dd) # apply hirarchical clustering
Error in hclust(dd) : NA/NaN/Inf in foreign function call (arg 11)
I would not like to remove the first column as there could be multiple clicks for the same user which I consider to be important for the analysis.
It sounds like you want to retain the first column (even though 10062 levels for 14634 observations is quite high). The way to convert a factor to numeric values is with the model.matrix
function. Before converting your factor:
data(iris)
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
After model.matrix
:
head(model.matrix(~.+0, data=iris))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Speciessetosa Speciesversicolor Speciesvirginica
# 1 5.1 3.5 1.4 0.2 1 0 0
# 2 4.9 3.0 1.4 0.2 1 0 0
# 3 4.7 3.2 1.3 0.2 1 0 0
# 4 4.6 3.1 1.5 0.2 1 0 0
# 5 5.0 3.6 1.4 0.2 1 0 0
# 6 5.4 3.9 1.7 0.4 1 0 0
As you can see, it expands out your factor values. So you could then run k-means clustering on the expanded version of your data:
kmeans(model.matrix(~.+0, data=iris), centers=3)
# K-means clustering with 3 clusters of sizes 49, 50, 51
#
# Cluster means:
# Sepal.Length Sepal.Width Petal.Length Petal.Width Speciessetosa Speciesversicolor Speciesvirginica
# 1 6.622449 2.983673 5.573469 2.032653 0 0.0000000 1.00000000
# 2 5.006000 3.428000 1.462000 0.246000 1 0.0000000 0.00000000
# 3 5.915686 2.764706 4.264706 1.333333 0 0.9803922 0.01960784
# ...