rfunctiondatetapply

R- tapply doesn't keep dates formated


I need to do a fast aggregation by id_client of dates: min, max, difference of dates in months and quantity of months.

Example table:

tbl<-data.frame(id_cliente=c(1,1,1,1,2,3,3,3), 
fecha=c('2013-01-01', '2013-06-01','2013-05-01', '2013-04-01', '2013-01-01', '2013-01-01','2013-05-01','2013-04-01'))

Format dates:

tbl$fecha<-as.Date(as.character(tbl$fecha))

My first approach was ddply:

tbl2<-ddply(tbl, .(id_cliente), summarize, cant=length(id_cliente), 
max=max(fecha), min=min(fecha),
dif=length(seq(from=min, to=max, by='month')))

I got the desired result, but with my real table takes too much time. So I tried tapply:

tbl3<-data.frame(cbind(dif=tapply(tbl$fecha, list(tbl$id_cliente), secuencia),
        hay=tapply(tbl$fecha, list(tbl$id_cliente), length),
        min=tapply(tbl$fecha, list(tbl$id_cliente), min),
        max=tapply(tbl$fecha, list(tbl$id_cliente), max)
        ))

The result was:

> tbl3
  dif hay   min   max
   6   4 15706 15857
   1   1 15706 15706
   5   3 15706 15826

In this case I got instead of dates, numbers. So since the following works, I tried using as.Date inside tapply:

as.Date(15706, origin='1970-01-01')

MIN<-function(x){as.Date(min(x), origin='1970-01-01')}

The function works but with tapply doesn't.

tbl3<-data.frame(cbind(min=tapply(tbl$fecha, list(tbl$id_cliente), MIN)))

And I still got the number instead of date. How can I solve this? Thanks.


Solution

  • With base R, the ?Date class is converted to the number of days from Jan. 1, 1970. Try using dplyr or data.table to preserve the date class:

    dplyr

    library(dplyr)
    tbl %>% group_by(id_cliente) %>%
            summarise(dif=length(seq(min(fecha), max(fecha), by='month')),
                      hay=length(fecha),
                      min=min(fecha),
                      max=max(fecha))
    # Source: local data frame [3 x 5]
    # 
    #   id_cliente dif hay        min        max
    # 1          1   6   4 2013-01-01 2013-06-01
    # 2          2   1   1 2013-01-01 2013-01-01
    # 3          3   5   3 2013-01-01 2013-05-01
    

    data.table

    library(data.table)
    setDT(tbl)[,.(dif=length(seq(min(fecha), max(fecha), by='month')),
                  hay= .N,
                  min=min(fecha),
                  max=max(fecha)), by=id_cliente]
    #    id_cliente dif hay        min        max
    # 1:          1   6   4 2013-01-01 2013-06-01
    # 2:          2   1   1 2013-01-01 2013-01-01
    # 3:          3   5   3 2013-01-01 2013-05-01