Calculate retention rate / churn by year split
Dear Community, I am working on a data mining project where I would like to transform prior thinking from excel into R.
I have a customer database with contracts data and would like to calculate the retention rate.
I was playing around with these library(lubridate)
; library(reshape2)
; library(plyr)
but I couldn't figure it out how it works in R.
I have data like this:
ID Customer START END
1 Tesco 01-01-2000 31-12-2000
2 Apple 05-11-2001 06-02-2002
3 H&M 01-02-2002 08-05-2002
4 Tesco 01-01-2001 31-12-2001
5 Apple 01-01-2003 31-12-2004
I now was thinking of splitting the data into the Years (df2000, df2001) and then look it up again if the customer name exists in the main table (if yes return 1).
A result could look like this:
Customer 2000 2001 2002 2003 Retention Rate
Tesco 1 1 0 0 0.5
Apple 0 1 0 1
H&M 0 0 1 0
Using dplyr
, you can try to get year
value from each START
date, count
number of entries for each Customer
and year
, calculate the retention rate and spread
the data to wide format.
library(dplyr)
df %>%
mutate(year = format(as.Date(START, format = "%d-%m-%Y"), "%Y")) %>%
dplyr::count(Customer, year) %>%
group_by(Customer) %>%
mutate(ret = n()/n_distinct(.$year)) %>%
tidyr::spread(year, n, fill = 0)
# Customer ret `2000` `2001` `2002` `2003`
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Apple 0.5 0 1 0 1
#2 H&M 0.25 0 0 1 0
#3 Tesco 0.5 1 1 0 0
EDIT
To consider data with Fiscal year instead from Oct-Sep we can do
library(lubridate)
df %>%
mutate(START = dmy(START),
START = if_else(month(START) >= 10, START + years(1), START),
year = year(START)) %>%
dplyr::count(Customer, year) %>%
group_by(Customer) %>%
mutate(ret = n()/n_distinct(.$year)) %>%
tidyr::spread(year, n, fill = 0)
data
df <- structure(list(ID = 1:5, Customer = structure(c(3L, 1L, 2L, 3L,
1L), .Label = c("Apple", "H&M", "Tesco"), class = "factor"),
START = structure(c(1L, 5L, 4L, 2L, 3L), .Label = c("01-01-2000",
"01-01-2001", "01-01-2003", "01-02-2002", "05-11-2001"), class = "factor"),
END = structure(c(3L, 1L, 2L, 4L, 5L), .Label = c("06-02-2002",
"08-05-2002", "31-12-2000", "31-12-2001", "31-12-2004"), class = "factor")),
class = "data.frame", row.names = c(NA, -5L))