rcluster-analysishierarchical-clustering

basic clustering with r


I'm trying to create a simple custom recommendation system for a web site. So, as input information I have user/session-id,item-id,item-price which users clicked at.

c165c2ee-81cf-48cf-ba3f-83b70204c00c    161785  124.0
a886fdd5-7cee-4152-b1b7-77a2702687b0    643339  42.0
5e5fd670-b104-445b-a36d-b3798cd43279    131332  38.0
888d736f-99bc-49ca-969d-057e7d4bb8d1    1032763 39.0

I would like to apply cluster analysis to that data.

If I try to apply k-means clustering to my data:

> q <- kmeans(dat, centers=25)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In kmeans(dat, centers = 25) : NAs introduced by coercion

If I try to apply hierarchial clustering to the data:

> m <- as.matrix(dat)
> d <- dist(m)   # find distance matrix
Warning message:
In dist(m) : NAs introduced by coercion

The "NAs introduced by coercion" seems to happen as a first column is not a number. So, I've tried to run the code against dat[-1] but result is the same.

What am I missing or doing wrong?

Output on str and factor:

> str(dat)
'data.frame':   14634 obs. of  3 variables:
 $ V3 : Factor w/ 10062 levels "000880bf-6cb7-4c4a-9a9d-1c0a975b52ba",..: 7548 6585 3670 5336 9181 6429 62 410 7386 9409 ...
 $ V8 : Factor w/ 5561 levels "1000120","1000910",..: 835 3996 443 65 1289 2084 582 695 3666 4787 ...
 $ V12: Factor w/ 395 levels "100.0","101.0",..: 25 278 249 256 352 249 1 88 361 1 ...

> dat[,1] = factor(dat[,1])
> str(dat)
'data.frame':   14634 obs. of  3 variables:
 $ V3 : Factor w/ 10062 levels "000880bf-6cb7-4c4a-9a9d-1c0a975b52ba",..: 7548 6585 3670 5336 9181 6429 62 410 7386 9409 ...
 $ V8 : Factor w/ 5561 levels "1000120","1000910",..: 835 3996 443 65 1289 2084 582 695 3666 4787 ...
 $ V12: Factor w/ 395 levels "100.0","101.0",..: 25 278 249 256 352 249 1 88 361 1 ...

> dd <- dist(dat)
Warning message:
In dist(dat) : NAs introduced by coercion
> hc <- hclust(dd)                # apply hirarchical clustering
Error in hclust(dd) : NA/NaN/Inf in foreign function call (arg 11)

I would not like to remove the first column as there could be multiple clicks for the same user which I consider to be important for the analysis.


Solution

  • It sounds like you want to retain the first column (even though 10062 levels for 14634 observations is quite high). The way to convert a factor to numeric values is with the model.matrix function. Before converting your factor:

    data(iris)
    head(iris)
    #   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    # 1          5.1         3.5          1.4         0.2  setosa
    # 2          4.9         3.0          1.4         0.2  setosa
    # 3          4.7         3.2          1.3         0.2  setosa
    # 4          4.6         3.1          1.5         0.2  setosa
    # 5          5.0         3.6          1.4         0.2  setosa
    # 6          5.4         3.9          1.7         0.4  setosa
    

    After model.matrix:

    head(model.matrix(~.+0, data=iris))
    #   Sepal.Length Sepal.Width Petal.Length Petal.Width Speciessetosa Speciesversicolor Speciesvirginica
    # 1          5.1         3.5          1.4         0.2             1                 0                0
    # 2          4.9         3.0          1.4         0.2             1                 0                0
    # 3          4.7         3.2          1.3         0.2             1                 0                0
    # 4          4.6         3.1          1.5         0.2             1                 0                0
    # 5          5.0         3.6          1.4         0.2             1                 0                0
    # 6          5.4         3.9          1.7         0.4             1                 0                0
    

    As you can see, it expands out your factor values. So you could then run k-means clustering on the expanded version of your data:

    kmeans(model.matrix(~.+0, data=iris), centers=3)
    # K-means clustering with 3 clusters of sizes 49, 50, 51
    # 
    # Cluster means:
    #   Sepal.Length Sepal.Width Petal.Length Petal.Width Speciessetosa Speciesversicolor Speciesvirginica
    # 1     6.622449    2.983673     5.573469    2.032653             0         0.0000000       1.00000000
    # 2     5.006000    3.428000     1.462000    0.246000             1         0.0000000       0.00000000
    # 3     5.915686    2.764706     4.264706    1.333333             0         0.9803922       0.01960784
    # ...