I have a dataset that I have created in R. It is structured as follows:
> head(btc_data)
Date btc_close eth_close vix_close gold_close DEXCHUS change
1647 2010-07-18 0.09 NA NA NA NA 0
1648 2010-07-19 0.08 NA 25.97 115.730 NA -1
1649 2010-07-20 0.07 NA 23.93 116.650 NA -1
1650 2010-07-21 0.08 NA 25.64 115.850 NA 1
1651 2010-07-22 0.05 NA 24.63 116.863 NA -1
1652 2010-07-23 0.06 NA 23.47 116.090 NA 1
I am trying to cluster the observations using k-means. However, I get the following error message:
> km <- kmeans(trainingDS, 3)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In storage.mode(x) <- "double" : NAs introduced by coercion
What does this mean? Am I prepocessing the data incorrectly? What can I do to fix it? I cant drop the NA's because out of 4500 initial observations, if i run complete cases
I am left with only 100 observations.
Essentially I am hoping that 3 clusters will form based on the change
column which has values of -1,0,1. I then wish to analyze the components of each cluster to find the strongest predictors for change. What other algorithms that would be most useful for doing this?
I also tried to remove all the NA values using the following code, but I still get the same error message:
> complete_cases <- btc_data[complete.cases(btc_data), ]
> km <- kmeans(complete_cases, 3, nstart = 20)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In storage.mode(x) <- "double" : NAs introduced by coercion
> sum(!sapply(btc_data, is.finite))
[1] 8008
> sum(sapply(btc_data, is.nan))
[1] 0
>
> sum(!sapply(complete_cases, is.finite))
[1] 0
> sum(sapply(complete_cases, is.nan))
[1] 0
Here is the format of the data:
> sapply(btc_data, class)
Date btc_close eth_close vix_close gold_close DEXCHUS change
"Date" "numeric" "numeric" "numeric" "numeric" "numeric" "factor"
There is a variety of reasons for getting this error message, in particular in the presence of invalid data types (NA, NaN, Inf) or dates. Let's go through them:
But first, let's check that it works with the mtcars
dataset since I will be using it:
kmeans(mtcars, 3)
K-means clustering with 3 clusters of sizes 9, 7, 16
--- lengthy output omitted
Likely problem 1: invalid data types: NA/NaN/Inf
df <- mtcars
df[1,1] <- NA
kmeans(df, 3)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
df[1,1] <- Inf
kmeans(df, 3)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
df[1,1] <- NaN
kmeans(df, 3)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
You can check for these values using the following:
df[1:3,1] <- c(NA, Inf, NaN) # one NA, one Inf, one NaN
sum(sapply(df, is.na))
[1] 2
sum(sapply(df, is.infinite))
[1] 1
sum(sapply(df, is.nan))
[1] 1
To get rid of these, we can remove the corresponding observations. But note that complete.cases
does not remove Inf
:
complete_df <- df[complete.cases(df),]
sum(sapply(complete_df, is.infinite))
[1] 1
Instead, use e.g.
df[apply(sapply(df, is.finite), 1, all),]
You can also reassign these values or impute them, but this is a whole different procedure.
Likely problem II: Dates: See the following:
library(lubridate)
df <- mtcars
df$date <- seq.Date(from=ymd("1990-01-01"), length.out = nrow(df), by=1)
kmeans(df, 3)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In kmeans(df, 3) : NAs introduced by coercion
You can get around this problem by excluding the dates or by converting the dates to something else, e.g.
df$newdate <- seq_along(df$date)
df$date <- NULL
kmeans(df, 3)
K-means clustering with 3 clusters of sizes 9, 7, 16
---- lengthy output omitted
Or you can try to coerce the dates to numeric yourself before you pass it to kmeans
:
df <- mtcars
df$date <- seq.Date(from=ymd("1990-01-01"), length.out = nrow(df), by=1)
df$date <- as.numeric(df$date)
kmeans(df, 3)
K-means clustering with 3 clusters of sizes 9, 16, 7
--- lengthy output omitted