Revisiting Online Linear Optimization with Follow the Leader Algorithms

This post is a continuation of my previous posts on Online Learning. Here is a quick recap.

In the post on Online Combinatorial Optimization(OCO), I talked about the Hedge algorithm, which in the context of OCO is also called as Exp2. For the problem of Online Linear Optimization on the Hypercube, Exp2 is an exponential time algorithm as it maintains a probability distribution on \(2^n\) things.

The Exp2 Algorithm:

Initialize \(w_{1}(X)=1 \quad\forall X \in \{0,1\}^n\)
For \(t=1\) to \(T\):
1. Let \(Z_t = \sum_{X \in \{0,1\}^n} w_{t}(X)\)and \(p_{t}(X) = w_{t}(X)/Z_t\). Play \(X_t\) according to probability distribution \(p_{t}(X)\).
2. See the loss vector \(l_{t} \in [-1,1]^n\) and incur loss \(l_t^\top X_t\)
3. Update \(w_{t+1}(X) = w_{t}(X) \exp(-\eta l_t^\top X)\) or equivalently \(w_{t+1}(X) = \exp(-\eta \sum_{\tau = 1}^t l_\tau^\top X)\)

In the same post, I came up with another algorithm, which I now call PolyExp, that runs in polynomial time. More importantly, I prove that PolyExp is equivalent to Exp2 for OLO on the Hypercube.

The PolyExp algorithm:

Initialize \(x_{i,1}=1/2 \quad \forall i \in [n]\)
For \(t=1\) to \(T\):
1. Sample \(X_{i,t} \sim \text{Bernoulli}(x_{i,t}) \quad \forall i \in [n]\) and Play \(X_t = [X_{1,t},X_{2,t},...,X_{3,t}]\)
2. See the loss vector \(l_{t} \in [-1,1]^n\) and incur loss \(l_t^\top X_t\)
3. Update

\[x_{i,t+1} = \frac{1}{1+\exp(\eta \sum_{\tau=1}^t l_{i,\tau})} \quad \text{or} \quad x_{i,t+1} = \frac{x_{i,t}}{x_{i,t} + (1-x_{i,t})\exp(\eta l_{i,t})}\]

Analyzing Exp2 directly gives a regret bound of \(O(n^{3/2}\sqrt{T})\) in the full information setting.

In the post on Online Mirror Descent(OMD), I introduced the OMD algorithm.

Online Mirror Descent:

The player picks \(y_1 = \nabla R^\star (0)\) and \(x_1 = \arg \min \limits_{x\in [0,1]^n} D_R(x\|y_1)\)
For \(t=1...T\)
1. Let \(p_t\) be a distribution on \(\{0,1\}^n\) such that \(\mathbb{E}_{X \sim p_t}[X]=x_t\). Player plays \(X_t \sim p_t\).
2. See the loss vector \(l_{t} \in [-1,1]^n\) and incur loss \(l_t^\top X_t\).
3. Let \(y_{t+1} = \nabla R^\star( \nabla R(x_{t}) - \eta l_t)\)
4. Update \(x_{t+1} = \arg \min \limits_{x\in[0,1]^n} D_R(x\|y_{t+1})\)

Another way to write this update is:

\[x_{t+1} = \arg \min \limits_{x\in[0,1]^n} \left[ \eta l_t^\top x + D_R(x\|x_{t})\right]\]

I showed that PolyExp is equivalent to OMD with regularizer \(R(x) = \sum_{i=1}^n x_i\log x_i +(1-x_i) \log (1-x_i)\) and Bernoulli Sampling scheme \(X_{i,t} \sim \text{Bernoulli}(x_{i,t}) \quad \forall i \in [n]\) in step 2.

Using OMD’s analysis, we were able to show that PolyExp has a regret bound of \(O(n\sqrt{T})\) in the full information setting.

Because of the regularization function we chose, we don’t really have to perform the Bregman Projections step in OMD, as \(y_{t+1}\) is already inside \([0,1]^n\). So, it is easy to come up with a Follow the Regularized Leader(FTRL) algorithm that is equivalent to the above OMD.

Follow the Regularized Leader algorithm:

The player picks \(x_1 = \arg \min_{x \in [0,1]^n} R(x)\)
For \(t=1...T\)
1. Let \(p_t\) be a distribution on \(\{0,1\}^n\) such that \(\mathbb{E}_{X \sim p_t}[X]=x_t\). Player plays \(X_t \sim p_t\).
2. See the loss vector \(l_{t} \in [-1,1]^n\) and incur loss \(l_t^\top X_t\).
3. Update

\[x_{t+1} = \arg \min_{x \in [0,1]^n} \left[\eta \sum_{\tau=1}^t l_\tau^\top x + R(x)\right]\]

Using \(R(x) = \sum_{i=1}^n x_i\log x_i +(1-x_i) \log (1-x_i)\), we can solve the argmin problem in the above step easily by differentiating and equating to 0.

\[\begin{align} \frac{\partial}{\partial x_i} \left(\eta\sum_{\tau=1}^t l_\tau^\top x + R(x)\right) &= \eta\sum_{\tau=1}^t l_{i,\tau} + \log x_i - \log(1-x_i) = 0\\ \implies x_i&= \frac{1}{1+\exp(\eta \sum_{\tau=1}^t l_{i,\tau})} \end{align}\]

This is the same update equation as PolyExp. The sampling scheme is obviously Bernoulli.

Now, I will describe a Follow the Perturbed Leader(FTPL) algorithm and prove that it is equivalent to PolyExp.

Follow the Perturbed Leader algorithm:

The player picks \(X_1 = \arg \min_{x \in \{0,1\}^n} v^\top x\), where \(v=[v_1, v_2, \dots, v_n]\), which are drawn iid from the Logistic distribution with CDF \((1+\exp(-\theta))^{-1}\).
For \(t=1...T\)
1. Player plays \(X_t\).
2. See the loss vector \(l_{t} \in [-1,1]^n\) and incur loss \(l_t^\top X_t\).
3. Let \(v=[v_1, v_2, \dots, v_n]\), drawn iid from the Logistic distribution. Update

\[X_{t+1} = \arg \min_{x \in \{0,1\}^n} \left[\eta \sum_{\tau=1}^t l_\tau^\top x + v^\top x\right]\]

Another way to write this update is

\[X_{t+1} = \arg \min_{x \in \{0,1\}^n} \left[\eta l_t^\top x + v_t^\top x\right]\]

Here \(v_{t,i}\) is drawn from the Logistic distribution with mean \(\mu_i = \eta \sum_{\tau=1}^{t-1} l_{\tau,i}\). This distribution has CDF \((1+\exp(-\theta + \eta \sum_{\tau=1}^{t-1}l_{\tau,i}))^{-1}\)

First, note that \(\arg \min_{x \in \{0,1\}^n} \left[\eta \sum_{\tau=1}^t l_\tau^\top x + v^\top x\right]\) can be solved easily. As it is a minimization problem, \(X_i=1\) if \(\eta \sum_{\tau=1}^t l_{i,\tau} + v_i \leq 0\) and \(X_i=0\) otherwise. We do not need to do Bernoulli sampling as, we get a vertex of the hypercube directly from the minimization.

Next, as \(v\) is a random variable, the output of the minimization step \(X_{t+1}\) will also be a random variable. Moreover, using the above observation and the fact that \(v\) is a vector of iid random variables, we can conclude that \(X_{i,t+1}\) are also independent. We have that:

\[\begin{align} \Pr(X_{i,t+1}=1) &= \Pr(\eta \sum_{\tau=1}^t l_{i,\tau} + v_i \leq 0)\\ &=\Pr( v_i \leq -\eta \sum_{\tau=1}^t l_{i,\tau} )\\ &=\frac{1}{1+\exp(\eta \sum_{\tau=1}^t l_{i,\tau})} \end{align}\]

To obtain the last equality, we use the fact that \(v_i\) is drawn from the Logistic distribution. We can see this is equal to the probability of picking \(X_{i,t+1}=1\) in the PolyExp algorithm as \(X_{i,t+1} \sim Bernoulli(x_{i,t+1})\) where \(x_{i,t+1} = (1+\exp(\eta \sum_{\tau=1}^t l_{i,\tau}))^{-1}\)

In conclusion, we have the following equivalences for OLO on the Hypercube.

Exp2
PolyExp
OMD with Entropic Regularization and Bernoulli Sampling
FTRL with Entropic Regularization and Bernoulli Sampling
FTPL with perturbation drawn iid from a Logistic distribution.

This is not very surprising. For instance, it is a well known fact that for the Experts problem on the probability simplex, we have the followng equivalences

Hedge
OMD with Entropic Regularization
FTRL with Entropic Regularization
FTPL with Gumbel noise.

Written on January 19, 2019