rdatabasesplitretentionchurn

Calculate retention rate split by year


Calculate retention rate / churn by year split

Dear Community, I am working on a data mining project where I would like to transform prior thinking from excel into R.

I have a customer database with contracts data and would like to calculate the retention rate. I was playing around with these library(lubridate); library(reshape2); library(plyr) but I couldn't figure it out how it works in R.

I have data like this:

ID    Customer        START          END
 1       Tesco   01-01-2000   31-12-2000
 2       Apple   05-11-2001   06-02-2002
 3         H&M   01-02-2002   08-05-2002
 4        Tesco  01-01-2001   31-12-2001
 5       Apple   01-01-2003   31-12-2004

I now was thinking of splitting the data into the Years (df2000, df2001) and then look it up again if the customer name exists in the main table (if yes return 1).

A result could look like this:

Customer     2000    2001    2002  2003   Retention Rate
Tesco         1        1      0     0          0.5
Apple         0        1      0     1
H&M           0        0      1     0

Solution

  • Using dplyr, you can try to get year value from each START date, count number of entries for each Customer and year, calculate the retention rate and spread the data to wide format.

    library(dplyr)
    df %>%
      mutate(year = format(as.Date(START, format = "%d-%m-%Y"), "%Y")) %>%
      dplyr::count(Customer, year) %>%
      group_by(Customer) %>%
      mutate(ret = n()/n_distinct(.$year))  %>%
      tidyr::spread(year, n, fill = 0) 
    
    #  Customer   ret  `2000` `2001` `2002` `2003`
    #  <fct>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
    #1 Apple     0.5       0      1      0      1
    #2 H&M       0.25      0      0      1      0
    #3 Tesco     0.5       1      1      0      0
    

    EDIT

    To consider data with Fiscal year instead from Oct-Sep we can do

    library(lubridate)
    
    df %>%
      mutate(START = dmy(START), 
             START = if_else(month(START) >= 10, START + years(1), START),
             year = year(START)) %>%
      dplyr::count(Customer, year) %>%
      group_by(Customer) %>%
      mutate(ret = n()/n_distinct(.$year))  %>%
      tidyr::spread(year, n, fill = 0) 
    

    data

    df <- structure(list(ID = 1:5, Customer = structure(c(3L, 1L, 2L, 3L, 
    1L), .Label = c("Apple", "H&M", "Tesco"), class = "factor"), 
    START = structure(c(1L, 5L, 4L, 2L, 3L), .Label = c("01-01-2000", 
    "01-01-2001", "01-01-2003", "01-02-2002", "05-11-2001"), class = "factor"), 
    END = structure(c(3L, 1L, 2L, 4L, 5L), .Label = c("06-02-2002", 
    "08-05-2002", "31-12-2000", "31-12-2001", "31-12-2004"), class = "factor")), 
    class = "data.frame", row.names = c(NA, -5L))