I have a dataset of people where I know their age, address, and past behaviour. I know the date on which they took a certain action and Im trying to create a logistic regression model to identify who was most likely in to undertake the action on any given day.
What I can do is create a data frame for each day as below. note - Sex remains the same over time, events are a running total of some past behaviour which I think makes the outcome more likely, address may change, outcome is what is being predicted "did this person do the thing I want to predict or not on this day?". The example below is what the data may look like on two different days.
I can create a logistic regression model for each day and add in the probabilities as per below.
However my data has thousands of people and events over several years. What I ultimately want to do is top slice the top 1%, 2% and 3% by probabilities each day and compare the predicted individuals to another predictive model.
The issue is I can't create a data frame for each day and generate a separate model for each day, there are thousands of days. Is there some way of automating this the production of a logistic regression model for each day.
day_one_data = data.frame(sex=c(1, 0, 1, 1, 1, 0, 1, 0),
events=c(0, 1, 0, 4, 2, 0, 1, 0),
age=c(21, 18, 40, 18, 19, 35, 22, 39),
address=c(4, 1, 2, 3, 3, 1, 3, 4),
outcome=c(0, 0, 0, 1, 1, 1, 0, 0)
)
model <- glm(outcome~sex+events+age+address, family = "binomial", data=day_one_data)
day_one_data$outcome_prob <- predict(model, day_one_data, type="response")
day_two_data = data.frame(sex=c(1, 0, 1, 1, 1, 0, 1, 0),
events=c(1, 2, 1, 6, 2, 0, 1, 0),
age=c(22, 19, 40, 18, 20, 36, 22, 40),
address=c(4, 1, 2, 3, 3, 1, 3, 3),
outcome=c(1, 1, 0, 1, 1, 1, 0, 0)
)
model_two <- glm(outcome~sex+events+age+address, family = "binomial", data=day_two_data)
day_two_data$outcome_prob <- predict(model_two, day_two_data, type="response")
As a result of the accepted answer I have added in an example of the structure of my data as to make what I am trying to achieve clearer.
#example
full_data <- data.frame(ID = c(1,2,3,4,2,1,5,6,6,4), # unique ID number - in the real data there are 30K+ individuals
sex = c("Male", "Female", "Male", "Male", "Female", "Male", "Male", "Male", "Male", "Male"), # gender which is static
event_date = c("2019-09-27 150021043000000585680090","2019-10-01 150021043000000585680090","2019-11-24 150021043000000585680090","2019-12-09 150021043000000585680090","2020-01-01 150021043000000585680090","2020-02-01 150021043000000585680090","2020-03-01 150021043000000585680090","2020-04-10 150021043000000585680090", "2020-05-12 150021043000000585680090","2020-06-12 150021043000000585680090"), # date of the event, this is the outcome I any trying to predict those most likely to do this on any given day, the event is rare the majority of people do not do it at all on that day, there are 60K events some individuals do the event multiple times
age = c(20, 22, 24, 19, 22, 21, 35, 24, 24, 20), # age at time of event
previous_events = c(0, 0, 0, 0, 1, 1, 0, 0, 1, 1), # count of previous events prior to the one happening on this event_date
address = c("A123", "B123", "C123", "A123", "B123", "A123", "D123", "B123", "B123", "A1234" )) # static postcode which is a zipcode equivalent, covers a large neighbourhood so some people share a postcode)
The aim is to have an output of a logistic regression model for each day that shows the likelihood of each individual in the data undertaking an event on that day, as the suggested answer does I would then top slice by percentiles.
You can fit models in a loop, subsetting the data as needed, and only storing the results that you want to keep - you mention the top 1%, 2%, and 3% for each day.
This code goes through a pre-set vector of dates and fits a model for each date using data from previous days. It then runs a prediction for observations on the date, filters the results down to the top 3%, and puts them in a list.
## create a vector of days that you want to model
days = seq(from = as.Date("2021-01-01", to = as.Date("2023-10-16"))
## create a list to store the results
results = list()
## fit your models and store your predictions
for(i in seq_along(days_to_model)) {
## fit the model to earlier data
day_model = glm(
outcome ~ sex + events + age + address,
family = "binomial",
data = subset(full_data, date < days[i])
)
## predict on the current day
prediction = subset(full_data, date = days[i])
predict$pred = predict(
day_model,
newdata = subset(full_data, date == days[i]),
type = "response")
)
## calculate quantiles
prediction$quantile = cut(
prediction$pred,
breaks = quantile(predict$pred, probs = c(0, 0.97, 0.98, 0.99, 1)),
labels = c("low", "97 %tile", "98 %tile", "99 %tile")
)
## drop all below 97th %tile
prediction = subset(prediction, quantile != "low")
## maybe drop other unneeded columns??
## keep the rest
results[[i]] = prediction
}
names(results) = days
## access results with, e.g, `results[[40]]` for the 40th day
## or `results[["2022-01-05"]]` for a specific date by name