Exp3: or How to do well in an exam with 0 preparation under bandit feedback?

This is a continuation of my previous blog post on Hedge. Here we consider Bandit feedback instead of full information.

The exam proceeds as follows: It has \(T\) questions, each of which has 4 options (A,B, C and D). The professor has assigned different scores for each option. These scores also differ from question to question. After choosing an option for a question, you only get to know the score of the option you have chosen and not the other options. This is known as Bandit feedback. Assume that the scores are always in \([0,1]\).

One of the four choices (A,B, C and D) will have the highest cumulative score. You don’t know which one apriori. Your goal is to be competitive with the best choice in hindsight. The scores of the options for each question are chosen adversarially by the professor. Assume you always choose option A, then the professor can see this and always make A have low score. Hence you have to randomize your strategy to choose options.

Let the reward function for question \(t\) be \(r_t\) and the choice you made be \(a_t\). (Note that this setting is different from my previous post where \(r_t(a) = 1\{a=a^\star_t\}\))

Your goal is to maximize the cumulative reward:

\[\sum_{t=1}^T r_t(a_t)\]

This is equivalent to minimizing the regret, which is defined as follows:

\[R_T = \max_{a \in \{A,B,C,D\}} \sum_{t=1}^T r_t(a) - \sum_{t=1}^T r_t(a_t)\]

Since you studied for the ML theory course, you know it looks like the Exp3 algorithm can be applied, so you decide to use it. The algorithm is as follows:

Initialize \(w^{(1)}_a = 1\) for \(a \in \{A,B, C, D\}\).
For \(t=1\) to \(T\):
1. Let \(Z_t = \sum_{a \in \{A,B, C, D\}} w^{(t)}_a\) and \(p^{(t)}_a = w^{(t)}_a/Z_t\). Choose \(a_t = a\) with probability \(p^{(t)}_a\).
2. See the reward \(r_t(a_t)\).
3. Let \(\hat{r}_t(a) = \frac{r_t(a)}{p^{(t)}_a}1_{a=a_t}\)
4. Update \(w^{(t+1)}_a = w^{(t)}_a \exp(\eta \hat{r}_t(a))\).

Here \(\eta\) is a parameter which we need to choose. This algorithm is quite similar to Hedge. The major difference is that we use \(\hat{r}_t(a)\) to compute \(w^{(t+1)}_a\) as we do not have \(r_t(a)\). Notice that \(\hat{r}_t(a)\) is an unbiased estimator of \(r_t(a)\)

\[\mathbb{E}_{a_t \sim p_a^{(t)}}[\hat{r}_t(a)] = \sum_{a \in \{A,B,C,D\}}\frac{r_t(a)}{p^{(t)}_a}p^{(t)}_{a_t}1_{a=a_t} = r_t(a)\]

Like before, since this is a randomized algorithm, we analyse the expected regret.

Analysis:

Since hedge works on any loss function, we can apply the hedge regret bound on the rewards \(\hat{r}_t(a)\).

\[\max_{a \in \{A,B,C,D\}} \sum_{t=1}^T \hat{r}_t(a) - \sum_{t=1}^T\langle p^{(t)},\hat{r}_t(a) \rangle \leq \frac{\eta}{2} \sum_{t=1}^T \langle p^{(t)},\hat{r}_t(a)^2\rangle + \frac{\log 4}{\eta}\]

Taking expectation over \(a_t \sim p_a^{(t)}\)on both sides:

\[\mathbb{E}_{a_t \sim p_a^{(t)} }[\max_{a \in \{A,B,C,D\}} \sum_{t=1}^T \hat{r}_t(a)] - \mathbb{E}_{a_t \sim p_a^{(t)} }[\sum_{t=1}^T\langle p^{(t)},\hat{r}_t(a)\rangle] \leq \frac{\eta}{2} \mathbb{E}_{a_t \sim p_a^{(t)} }[\sum_{t=1}^T \langle p^{(t)},\hat{r}_t(a)^2\rangle] + \frac{\log 4}{\eta}\]

First consider the term \(\mathbb{E}_{a_t \sim p_a^{(t)} }[\sum_{t=1}^T\langle p^{(t)},\hat{r}_t(a)\rangle]\):

\[\mathbb{E}_{a_t \sim p_a^{(t)} }[\sum_{t=1}^T\langle p^{(t)},\hat{r}_t(a)\rangle] = \sum_{t=1}^T \mathbb{E}_{a_t \sim p_a^{(t)} }[\langle p^{(t)},\hat{r}_t(a)\rangle] = \sum_{t=1}^T \langle p^{(t)},r_t(a)\rangle\]

This is the expected reward of the learner.

Next consider the term \(\mathbb{E}_{a_t \sim p_a^{(t)} }[\max_{a \in \{A,B,C,D\}} \sum_{t=1}^T \hat{r}_t(a)]\). Since max is a convex, we can apply Jensen’s inequality:

\[\mathbb{E}_{a_t \sim p_a^{(t)} }[\max_{a \in \{A,B,C,D\}} \sum_{t=1}^T \hat{r}_t(a)] \geq \max_{a \in \{A,B,C,D\}} \mathbb{E}_{a_t \sim p_a^{(t)} }[\sum_{t=1}^T \hat{r}_t(a)]] = \max_{a \in \{A,B,C,D\}} \sum_{t=1}^T r_t(a)\]

Combining these two, we can upper bound the expected regret:

\[\max_{a \in \{A,B,C,D\}} \sum_{t=1}^T r_t(a) - \sum_{t=1}^T \langle p^{(t)},r_t(a)\rangle \leq \frac{\eta}{2} \mathbb{E}_{a_t \sim p_a^{(t)} }[\sum_{t=1}^T \langle p^{(t)},\hat{r}_t(a)^2\rangle] + \frac{\log 4}{\eta}\]

Working on the term \(\mathbb{E}_{a_t \sim p_a^{(t)} }[ \langle p^{(t)},\hat{r}_t(a)^2\rangle]\):

\[\mathbb{E}_{a_t \sim p_a^{(t)} }[\langle p^{(t)},\hat{r}_t(a)^2\rangle] = \sum_{a_t} \sum_a p_{a}^{(t)}p_{a_t}^{(t)} \frac{r_t(a)^2}{(p^{(t)}_a)^2} 1_{a=a_t} = \sum_{a} r_t(a)^2 \leq \sum_{a} 1 = 4\]

Substituting this in the regret bound, we get:

\[\max_{a \in \{A,B,C,D\}} \sum_{t=1}^T r_t(a) - \sum_{t=1}^T \langle p^{(t)},r_t(a)\rangle \leq \frac{4\eta T}{2} + \frac{\log 4}{\eta}\]

Choosing \(\eta = {\frac{2\log 4}{4T}}\), we get \(\mathbb{E}[R_T] \leq \sqrt{8T\log 4}\).

In general, the regret inequality for \(N\) options in the bandit setting can be stated as:

\[\max_{a \in \{A,B,C,D\}} \sum_{t=1}^T r_t(a) - \sum_{t=1}^T\langle p^{(t)},r_t(a) \rangle \leq \frac{\eta}{2} \sum_{t=1}^T \mathbb{E}[\langle p^{(t)},\hat{r}_t(a)^2\rangle] + \frac{\log N}{\eta} \leq \frac{\eta NT}{2} + \frac{\log N}{\eta}\]

Choosing \(\eta = \sqrt{\frac{2 \log N}{TN}}\), we get that \(\mathbb{E}[R_T] \leq \sqrt{2TN\log N}\). In my post on Hedge, I proved a regret bound of \(\sqrt{2T\log N}\). Exp3 which basically runs hedge on unbiased estimates of the reward function, has an expected regret of at most \(\sqrt{2TN \log N}\) which is a factor of \(\sqrt{N}\) away from the full information case.

Written on May 18, 2018