rmachine-learningk-means

K-Means clustering in R error


I have a dataset that I have created in R. It is structured as follows:

> head(btc_data)
           Date btc_close eth_close vix_close gold_close DEXCHUS change
1647 2010-07-18      0.09        NA        NA         NA      NA      0
1648 2010-07-19      0.08        NA     25.97    115.730      NA     -1
1649 2010-07-20      0.07        NA     23.93    116.650      NA     -1
1650 2010-07-21      0.08        NA     25.64    115.850      NA      1
1651 2010-07-22      0.05        NA     24.63    116.863      NA     -1
1652 2010-07-23      0.06        NA     23.47    116.090      NA      1

I am trying to cluster the observations using k-means. However, I get the following error message:

> km <- kmeans(trainingDS, 3)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In storage.mode(x) <- "double" : NAs introduced by coercion 

What does this mean? Am I prepocessing the data incorrectly? What can I do to fix it? I cant drop the NA's because out of 4500 initial observations, if i run complete cases I am left with only 100 observations.

Essentially I am hoping that 3 clusters will form based on the change column which has values of -1,0,1. I then wish to analyze the components of each cluster to find the strongest predictors for change. What other algorithms that would be most useful for doing this?

I also tried to remove all the NA values using the following code, but I still get the same error message:

> complete_cases <- btc_data[complete.cases(btc_data), ]
> km <- kmeans(complete_cases, 3, nstart = 20)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In storage.mode(x) <- "double" : NAs introduced by coercion

> sum(!sapply(btc_data, is.finite)) 
[1] 8008
> sum(sapply(btc_data, is.nan))
[1] 0
> 
> sum(!sapply(complete_cases, is.finite)) 
[1] 0
> sum(sapply(complete_cases, is.nan))
[1] 0

Here is the format of the data:

> sapply(btc_data, class)
      Date  btc_close  eth_close  vix_close gold_close    DEXCHUS     change 
    "Date"  "numeric"  "numeric"  "numeric"  "numeric"  "numeric"   "factor" 

Solution

  • There is a variety of reasons for getting this error message, in particular in the presence of invalid data types (NA, NaN, Inf) or dates. Let's go through them:

    But first, let's check that it works with the mtcars dataset since I will be using it:

    kmeans(mtcars, 3)
    K-means clustering with 3 clusters of sizes 9, 7, 16
    --- lengthy output omitted
    

    Likely problem 1: invalid data types: NA/NaN/Inf

    df <- mtcars
    df[1,1] <- NA
    kmeans(df, 3)
    Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
    
    df[1,1] <- Inf
    kmeans(df, 3)
    Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
    
    df[1,1] <- NaN
    kmeans(df, 3)
    Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
    

    You can check for these values using the following:

    df[1:3,1] <- c(NA, Inf, NaN) # one NA, one Inf, one NaN
    sum(sapply(df, is.na))
    [1] 2
    sum(sapply(df, is.infinite))
    [1] 1
    sum(sapply(df, is.nan))
    [1] 1
    

    To get rid of these, we can remove the corresponding observations. But note that complete.cases does not remove Inf:

    complete_df <- df[complete.cases(df),]
    sum(sapply(complete_df, is.infinite))
    [1] 1
    

    Instead, use e.g.

    df[apply(sapply(df, is.finite), 1, all),]
    

    You can also reassign these values or impute them, but this is a whole different procedure.

    Likely problem II: Dates: See the following:

    library(lubridate)
    df <- mtcars
    df$date <- seq.Date(from=ymd("1990-01-01"), length.out = nrow(df), by=1)
    kmeans(df, 3)
    Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
    In addition: Warning message:
    In kmeans(df, 3) : NAs introduced by coercion
    

    You can get around this problem by excluding the dates or by converting the dates to something else, e.g.

    df$newdate <- seq_along(df$date)
    df$date <- NULL
    kmeans(df, 3)
    K-means clustering with 3 clusters of sizes 9, 7, 16
    ---- lengthy output omitted
    

    Or you can try to coerce the dates to numeric yourself before you pass it to kmeans:

    df <- mtcars
    df$date <- seq.Date(from=ymd("1990-01-01"), length.out = nrow(df), by=1)
    df$date <- as.numeric(df$date)
    kmeans(df, 3)
    K-means clustering with 3 clusters of sizes 9, 16, 7
    --- lengthy output omitted