Yet Another Derivation of Backpropagation in Matrix Form (this time using Adjoints)

Over 6 years ago, I wrote a blog post on the Derivation of Backpropagation in Matrix Form. My toolkit then included a basic understanding of the chain rule, linear algebra, and checking the dimension. And so I set about writing that post. With some questionable usage of the chain rule, I derived the backpropagation equations. At every step, I told myself - The dimensions check out, so it must be correct. That post became the most popular blog I had ever written. It still brings in a few hundred visitors each week.

A few years ago, I came across the Method of Adjoints in Prof. Ben Recht’s excellent blog. He also details how to derive backpropagation using this method. So, I decided to go through with this exercise, to see if my derivation from 6 years ago was correct.

Consider a simple feedforward network with \(L\) layers. At level \(i\), we have the computation \(z_i = f_i(W_i z_{i-1})\), where \(f_i\) is the activation function, \(W_i\) is the weight matrix, \(z_{i-1}\) is the input and \(z_i\) is the output.

On a training sample \((x,y)\), the forward pass proceeds as follows:

\[\begin{align*} z_1 &= f_1(W_1 x)\\ z_i &= f_i(W_i z_{i-1}) \quad \text{for}\quad i=2,3,\dots,L\\ \end{align*}\]

Finally, we use the training loss \(l\) to see how the output of the network \(z_L\) compares to the target \(y\). For simplicity, assume \(z_0=x\). Consider the optimization problem:

\[\begin{align*} \min_{W_1,\dots, W_L}\quad & l(z_ L,y)\\ \text{such that} \quad z_i &= f_i(W_i z_{i-1}) \quad \text{for}\quad i=1,3,\dots,L\\ \end{align*}\]

The Lagrangian of this optimization problem, with parameters \(\lambda_1,\dots, \lambda_L\) is:

\[\mathcal{L} = l(z_L,y) + \sum_{i=1}^L \lambda_i^\top(f_i(W_iz_{i-1})-z_i)\]

All we need to do now is find the derivatives wrt \(\lambda_i, z_i, W_i\) for \(i=1,\dots,L\) and equate to \(0\).

First, the \(\lambda_i\)s:

\[\nabla_{\lambda_i} \mathcal{L} = f_i(W_iz_{i-1})-z_i=0\implies z_i = f_i(W_iz_{i-1})\]

This recovers the forward propagation.

Now, consider the following useful result

Lemma: Let \(g(z,W) = \lambda^\top f(Wz)\). Then \(\nabla_z g = W^\top (\lambda \circ f'(Wz))\) and \(\nabla_W g= (\lambda \circ f'(Wz)) z^\top\)

Proof: Expanding \(g\), we have: \(g(z,W) = \sum_{i} \lambda_i f(w_i ^\top z)\). Here \(w_i\) is the \(i\)-th row of \(W\).

\[\begin{align*} \nabla_zg = \sum_i\lambda_i f'(w_i^\top z)w_i = W^\top(\lambda \circ f'(Wz)) \end{align*}\]

We can further expand \(g\) as: \(g(z,W) = \sum_{i} \lambda_i f(\sum_{j} w_{i,j} z_j)\).

\[\frac{\partial g}{\partial w_{i,j}} = \lambda_i f'(\sum_{j} w_{i,j} z_j) z_j = \lambda_if'( w_{i}^\top z)z_j\]

So, the matrix:

\[\nabla_W g = \left[\frac{\partial g}{\partial w_{i,j}}\right] = (\lambda \circ f'(Wz)) z^\top\]

Using this lemma, we can compute the derivatives wrt \(z_i\)s. For \(i=1,\dots, L-1\):

\[\begin{align*} &\nabla_{z_i} \mathcal{L} = -\lambda_i + W_{i+1}^\top (\lambda_{i+1} \circ f_{i+1}'(W_{i+1}z_i)) =0\\ \implies& \lambda_i = W_{i+1}^\top (\lambda_{i+1} \circ f_{i+1}'(W_{i+1}z_i)) \end{align*}\]

And:

\[\begin{align*} &\nabla_{z_L} \mathcal{L} = \nabla_{z_L} l(z_l,y) - \lambda_L = 0 \implies \lambda_L = \nabla_{z_L} l(z_l,y) \end{align*}\]

Finally, the derivatives wrt \(W_i\)s are:

\[\nabla_{W_i} \mathcal{L} = (\lambda_i \circ f_{i}'(W_iz_{i-1})) z_{i-1}^\top\]

Assuming this is a regression problem, we can set \(l(a,b) = \frac{1}{2}\|a-b\|_2^2\). So, the backward pass is:

\[\begin{align*} \lambda_L &= (z_L-y)\\ \lambda_i &= W_{i+1}^\top (\lambda_{i+1} \circ f_{i}'(W_{i+1}z_i)) \quad \text{for}\quad i=L-1,\dots,1 \end{align*}\]

And the gradient descent weight update is:

\[\begin{align*} \nabla_{W_i} \mathcal{L} &= (\lambda_i \circ f_{i}'(W_iz_{i-1})) z_{i-1}^\top \quad \text{for}\quad i=1,\dots,L\\ W_i &= W_i - \eta\nabla_{W_i} \mathcal{L} \end{align*}\]

Comparision from the last time I derived Backpropagation

The backward pass from that post looks like this:

Backward Pass:

\[\begin{align*} \delta_L&= (z_L-y)\circ f_L'(W_Lz_{L-1})\\ \delta_i &= W_{i+1}^T\delta_{i+1}\circ f_i'(W_iz_{i-1}) \quad \text{for}\quad i=L-1,\dots,1\\ \end{align*}\]

Weight Update:

\[\begin{align*} \frac{\partial E}{\partial W_i}&=\delta_iz_{i-1}^T\\ W_i&=W_i - \eta\circ \frac{\partial E}{\partial W_i} \end{align*}\]

Comparing the two, we can see that:

\[\delta_i = \lambda_i \circ f_{i}'(W_iz_{i-1})\]

So, both derivations of backpropagation are equivalent. More importantly, my derivation from 6 years ago is correct. This shows, once again that checking the dimension is probably the most important tool of them all.

Written on October 5, 2022