[SOLVED] Proper way to evaluate policy + exploration offline in Vowlpal Wabbit

Proper way to evaluate policy + exploration offline in Vowlpal Wabbit

My use case is to retrain/make predictions using VW CB in batch mode (retrain/inference occurs nightly).

I'm reading this tutorial for offline policy evaluation in the batch scenario. I'm training on a logged dataset using:

--cb_adf --save_resume -f {MODEL_PATH} -d ./data/train.txt

and in order to tune hyperparameter epsilon on batch predictions, I run the following commands 3 times on a separate dataset using:

-i {MODEL_PATH} -t --cb_explore_adf --epsilon 0.1/0.2/0.3 -d ./data/eval.txt

whichever gives the lowest average loss is the optimal epsilon.

Am I using the right options? My confusion mostly comes from the another option --explore_eval. What is the difference between --explore_eval and cb_explore_adf and what is the right way to evaluate model+exploration offline? Should I just run

--explore_eval --epsilon 0.1/0.2/0.3 -d ./data/train+eval.txt

and whichever gives the lowest average loss is the optimal epsilon.

Solution

-i {MODEL_PATH} -t --cb_explore_adf --epsilon 0.1/0.2/0.3 -d ./data/eval.txt

I predict the result of this experiment: the optimal epsilon is the smallest. This is because after data has been collected, there is no value to exploration. In order to assess exploration, you have to change the data available at training in a manner sensitive to the exploration algorithm. Which brings us to ...

--explore_eval --epsilon 0.1/0.2/0.3 -d ./data/train+eval.txt

'--explore_eval' is designed to assess exploration. It requires more data to work well (since it discards the data if the exploration doesn't match) but allows you to assess exploration since it simulates the fog of war.

If you are testing other model hyperparameters such as base learning algorithm or interactions, the extra data overhead of '--explore_eval' is unnecessary.