rr-caretdummy-variable

DummyVars function returning data type double values


The following code is attempting to use the dummyVars function in the caret package.

This is .rmd code and uses a dataset available in the ggplot2 package so this can be completely replicated.

```{r}
#rm(list = ls())
```

```{r}
library(ggplot2)
```

```{r}
data("diamonds")
```

```{r}
data <- diamonds
summary(data)
str(data)
```
```{r}
library(caret)
```

```{r}
dmy <- dummyVars(formula = ~ cut + color + clarity, 
                 data = data, 
                 fullRank = FALSE)

b.vars <- data.frame(predict(dmy, newdata = data))

head(b.vars, n = 10)
```

b.vars should be a data frame of the dummy variables(0s and 1s), but it is returning double values such as 0.6324555.

Also the column names in b.vars are not correct. For example there is "cut.L" instead of "cut.fair"

This is the same process I've used in the past and I don't understand what I'm doing wrong.

Could someone please point out my error?

Thanks!


Solution

  • library(ggplot2)
    library(caret)
    data("diamonds")
    data <- diamonds
    data
    summary(data)
    str(data)
    
    data$cut <- as.factor(as.character(data$cut))
    data$clarity <- as.factor(as.character(data$clarity))
    data$color <- as.factor(as.character(data$color))
    
    
    sapply(data, class)
    
    
    dmy <- dummyVars(formula = ~ cut + color + clarity, 
                     data = data, 
                     fullRank = TRUE)
        b.vars <- data.frame(predict(dmy, newdata = data))
    head(b.vars, n = 10)
    
       cut.Good cut.Ideal cut.Premium cut.Very.Good color.E color.F color.G color.H color.I color.J clarity.IF clarity.SI1 clarity.SI2 clarity.VS1 clarity.VS2 clarity.VVS1
    1         0         1           0             0       1       0       0       0       0       0          0           0           1           0           0            0
    2         0         0           1             0       1       0       0       0       0       0          0           1           0           0           0            0
    3         1         0           0             0       1       0       0       0       0       0          0           0           0           1           0            0
    4         0         0           1             0       0       0       0       0       1       0          0           0           0           0           1            0
    5         1         0           0             0       0       0       0       0       0       1          0           0           1           0           0            0
    6         0         0           0             1       0       0       0       0       0       1          0           0           0           0           0            0
    7         0         0           0             1       0       0       0       0       1       0          0           0           0           0           0            1
    8         0         0           0             1       0       0       0       1       0       0          0           1           0           0           0            0
    9         0         0           0             0       1       0       0       0       0       0          0           0           0           0           1            0
    10        0         0           0             1       0       0       0       1       0       0          0           0           0           1           0            0
       clarity.VVS2
    1             0
    2             0
    3             0
    4             0
    5             0
    6             1
    7             0
    8             0
    9             0
    10            0
    

    Get rid of the "ordered" class of your variables. You can do that by first converting the variable to character and back to factor on the fly.