rdplyrdiscretization

Discretizing a continous variable keeping out zeros


I want to discretize a column which contains of a continous variable.

the data looks like ;

c(0,25,77,423,6,8,3,65,32,22,10,0,8,0,15,0,10,1,2,4,5,5,6)

I want turn the numbers into categorical by discretizing, but zeros represent a different category. Sometimes directly discretizing could keep different numbers with zero.

I thought if I keep zeros out then discretize my wish comes true. But in a dataframe column I can't do it because of indexes:

here is an example dput() output

structure(list(dummy_column = c(0, 25, 77, 423, 6, 8, 3, 65, 
32, 22, 10, 0, 8, 0, 15, 0, 10, 1, 2, 4, 5, 5, 6)), class = "data.frame", row.names = c(NA, 
-23L))

for example, if I'd like to use 2 breaks, categories should be; zero and the other 3 discretized ones, totally 4 categories. it should be better if I could write function that discretizes a column that can be directly created with dplyr::mutate()

thanks in advance.


Solution

  • If I understood it correctly, your goal is to keep "0" as a separate category when discretizing. Here's a solution using arules::discretize to make a new function that can accomplish this:

    library(arules)
    #> Loading required package: Matrix
    #> 
    #> Attaching package: 'arules'
    #> The following objects are masked from 'package:base':
    #> 
    #>     abbreviate, write
    library(tidyverse)
    
    df <- structure(list(dummy_column = c(0, 25, 77, 423, 6, 8, 3, 65, 
                                    32, 22, 10, 0, 8, 0, 15, 0, 10, 1, 2, 4, 5, 5, 6)), class = "data.frame", row.names = c(NA, 
                                                                                                                            -23L))
    
    discretize_keep <- function(vec, keep, ...) {
        vec2 <- vec
        vec2[vec2==keep] <- NA
        dsc <- arules::discretize(vec2, ...)
        fct_explicit_na(dsc, na_level = str_glue("[{keep}]"))
    }
    
    df %>%
        mutate(discrete_column = discretize_keep(dummy_column, keep = 0, breaks = 3))
    #>    dummy_column discrete_column
    #> 1             0             [0]
    #> 2            25        [15,423]
    #> 3            77        [15,423]
    #> 4           423        [15,423]
    #> 5             6          [6,15)
    #> 6             8          [6,15)
    #> 7             3           [1,6)
    #> 8            65        [15,423]
    #> 9            32        [15,423]
    #> 10           22        [15,423]
    #> 11           10          [6,15)
    #> 12            0             [0]
    #> 13            8          [6,15)
    #> 14            0             [0]
    #> 15           15        [15,423]
    #> 16            0             [0]
    #> 17           10          [6,15)
    #> 18            1           [1,6)
    #> 19            2           [1,6)
    #> 20            4           [1,6)
    #> 21            5           [1,6)
    #> 22            5           [1,6)
    #> 23            6          [6,15)