python optimization data-science vowpalwabbit convergence

Optimizing CTR (click through rate) with VowpalWabbit: How to choose suitable parameters?

I am trying to optimize the click rate of some article or advertisement (action) given a device type (context) with VowpalWabbit (following this article vw tutorial). However, I am not able to make it converge to the optimal actions reliably.

I created a minimal working example (sorry for the length):

import random
import numpy as np
from matplotlib import pyplot as plt
from vowpalwabbit import pyvw

plt.ion()

action_space = ["article-1", "article-2", "article-3"]

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0))
    return (cumsum[N:] - cumsum[:-N]) / float(N)

def to_vw_example_format(context, cb_label=None):
    if cb_label is not None:
        chosen_action, cost, prob = cb_label
    example_string = ""
    example_string += "shared |User device={} \n".format(context)
    for action in action_space:
        if cb_label is not None and action == chosen_action:
            example_string += "1:{}:{} ".format(cost, prob)
        example_string += "|Action ad={} \n".format(action)
    # Strip the last newline
    return example_string[:-1]

# definition of problem to solve, playing out the article with highest ctr given a context
context_to_action_ctr = {
    "device-1": {"article-1": 0.05, "article-2": 0.06, "article-3": 0.04},
    "device-2": {"article-1": 0.08, "article-2": 0.07, "article-3": 0.05},
    "device-3": {"article-1": 0.01, "article-2": 0.04, "article-3": 0.09},
    "device-4": {"article-1": 0.04, "article-2": 0.04, "article-3": 0.045},
    "device-5": {"article-1": 0.09, "article-2": 0.01, "article-3": 0.07},
    "device-6": {"article-1": 0.03, "article-2": 0.09, "article-3": 0.04}
}

#vw = f"--cb_explore 3 -q UA -q UU --epsilon 0.1"
vw = f"--cb_explore_adf -q UA -q UU --bag 5 "
#vw = f"--cb_explore_adf -q UA --epsilon 0.2"

actor = pyvw.vw(vw)

random_rewards = []
actor_rewards = []
optimal_rewards = []

for step in range(200000):

    # pick a random context
    device = random.choice(list(context_to_action_ctr.keys()))

    # let vw generate probability distribution
    # action_probabilities = np.array(actor.predict(f"|x device:{device}"))
    action_probabilities = np.array(actor.predict(to_vw_example_format(device)))

    # sample action
    probabilities = action_probabilities / action_probabilities.sum()
    action_idx = np.random.choice(len(probabilities), 1, p=probabilities)[0]
    probability = action_probabilities[action_idx]

    # get reward/regret
    action_to_reward_regret = {
        action: (1, 0) if random.random() < context_to_action_ctr[device][action] else (0, 1) for action in action_space
    }

    actor_action = action_space[action_idx]
    random_action = random.choice(action_space)
    optimal_action = {
        "device-1": "article-2",
        "device-2": "article-1",
        "device-3": "article-3",
        "device-4": "article-3",
        "device-5": "article-1",
        "device-6": "article-2",
    }[device]

    # update statistics
    actor_rewards.append(action_to_reward_regret[actor_action][0])
    random_rewards.append(action_to_reward_regret[random_action][0])
    optimal_rewards.append(action_to_reward_regret[optimal_action][0])

    # learn online
    reward, regret = action_to_reward_regret[actor_action]
    cost = -1 if reward == 1 else 0
    # actor.learn(f"{action_idx+1}:{cost}:{probability} |x device:{device}")
    actor.learn(to_vw_example_format(device, (actor_action, cost, probability)))


    if step % 100 == 0 and step > 1000:
        plt.clf()
        axes = plt.gca()
        plt.title("Reward over time")
        plt.plot(running_mean(actor_rewards, 10000), label=str(vw))
        plt.plot(running_mean(random_rewards, 10000), label="Random actions")
        plt.plot(running_mean(optimal_rewards, 10000), label="Optimal actions")
        plt.legend()
        plt.pause(0.0001)

Essentially, there are three possible actions (article 1-3) and 6 contexts (device 1-6), each combination having a certain CTR (Click Through Rate) and an optimal action given a context (article with highest ctr given the device). With each iteration, a random context is sampled and the reward/regret for each action is computed. The cost used by VowpalWabbit to learn is -1 if the reward is 1 (a user clicked) or 0 if the reward is 0 (user did not click). With time, the algorithm is supposed to find the best articles for each device.

Some results (Average Reward over time):

The problems:

Starting the same program multiple times, leads to differing results (sometimes not converging at all, sometimes finding the optimal and staying with it)
With a bad start (early convergence to sub-optimal arms), with more time, the algorithm is still not moving to better arms.
It seems that the algorithm is early trapped in a local minimum, which defines the performance for the rest of the experiment. (even with some exploration factor)

As the CTRs are fairly small and a lot of playouts are needed for convergence, I understand the difficulty of the problem. I would however expect that with time, the algorithm would find the optimum.

Did I miss something in the configuration of VowpalWabbit?

Solution

Just "--bag 5" without epsilon parameter can start to produce zero probabilities if all policies agree with each other.

I modified your code a bit to track probabilities from actor.predict.

First plot: moving average for predicted probabilities for all "optimal actions". We can see that we are actually staying around 0 for few of them. And the tail of the table with decisions is showing that all distributions are actually [1, 0, 0] like. So there is no chance to recover from this point since we basically have exploration turned off.

Adding small amount of exploration (--bag 5 --epsilon 0.02) helps to converge to global minimum eventually and give plots like this:

Learning seems to be not fast but problematic contexts are actually the most ambiguous ones and do not contribute much to regret.