Nesterov

union

I did a presentation in a group meeting to briefly review the lower complexity bound of first-order convex optimization; and how Nesterov proceed to match the lower bound using the estimation sequence, the slides are here.

Consider an unconstrained optimization problem: $$ \min_{x \in \mathbb{R}^n} f(x) . $$ Here, \(f \in \mathcal{C}^1\) is a convex, $L$-Lipschitz smooth function. Obviously we can solve this problem by using first-order methods, using iterations:

\[ x_{k} \in x_{0} + \operatorname{Span} \left \{f^{\prime} \left (x_{0} \right ), \ldots, f^{\prime} \left (x_{k-1} \right )\right \}. \]

The question is what their fundamental limit is, and how to achieve it.

For fundamental limits, Nesterov constructed a quadratic function whose one-step gradients give very little geometry information about the other dimensions, so that once the dimensionality of the problem becomes large, it gets really hard to solve it. (As the error is lower bounded by $\mathcal{O} (1/k^2)$, with $k$ being the dimension.)

He then came up with an algorithm to match the lower bound. Despite the complexity of the algorithm itself, the idea is not that complicated:

We use a series of weighted quadratic functions $\{\lambda_k, \phi_k \}$ to estimate the function itself at each point $x_k$, which has two properties

  • They provide a lower bound to the true function at each step. $$ \phi_{k}(x) \leq (1-\lambda_{k} ) f(x)+\lambda_{k} \phi_{0}(x) $$
  • They converge to the true function as the algorithm progresses. $$ \lambda_k \to 0 $$

So intead of minimizing the function itself, we minimize $\{\phi_k \}$ at each step, which is easier but also matches the lower bound because it is quadratic!

There are of course a bunch of alternative interpretations about this method, one of which I found quite intriguing was the dual perspective1, which views every accelerated gradient step as an optimal coupling of the primal and dual step.


  1. Allen-Zhu, Z. and Orecchia, L., 2014. Linear coupling: An ultimate unification of gradient and mirror descent. arXiv preprint arXiv:1407.1537. ↩︎