vowpalwabbit

Vowpal Wabbit: question on training contextual bandit on historical data


I know from this page, that there is an option to train a Contextual Bandit VW model based on historical contextual bandit data collected using some exploration policy:

VW contains a contextual bandit module which allows you to optimize a predictor based on already collected contextual bandit data. In other words, the module does not implement exploration, it assumes it can only use the currently available data logged using an exploration policy.

And it is done by specifying --cb and passing data formatted like action:cost:probability | features :

1:2:0.4 | a c  
3:0.5:0.2 | b d  
4:1.2:0.5 | a b c  
2:1:0.3 | b c  
3:1.5:0.7 | a d 

My question is, is there a way to leverage historical data that was not based on a contextual bandit policy using --cb (or some other method) and some policy evaluation method? Let's say actions were chosen according to some deterministic, non-exploratory (Edit: biased) heuristic? In this case, I would have the action and the cost, but I wouldn't have the probability (or it would be equal to 1).

I've tried a method where I use an exploratory approach and assume that the historical data is fully labelled (assign reward of zero for unknown rewards) but it seemed that the PMF collapses to zero over most actions.


Solution

  • My question is, is there a way to leverage historical data that was not based on a contextual bandit policy using --cb (or some other method) and some policy evaluation method? Let's say actions were chosen according to some deterministic, non-exploratory heuristic? In this case, I would have the action and the cost, but I wouldn't have the probability (or it would be equal to 1).

    Yes, set the probability to 1. With a degenerate logging policy there are no theoretical guarantees but in practice this can be helpful for initialization. Going forward you'll want to have some nondeterminism in your logging policy or you will never improve.

    I've tried a method where I use an exploratory approach and assume that the historical data is fully labelled (assign reward of zero for unknown rewards) but the PMF collapses to zero over most actions.

    If you actually have historical data that is fully labeled you can use the warm start functionality. If you are pretending you have fully labeled data I'm not sure it's better than just setting the probability to 1.