The Multiple Polyak Stepsize Method
If you are a student of convex optimization, you may be aware of the Polyak stepsize that is optimal for first-order methods. I prove that it can be extended somewhat and show initial results that look promising.
The Polyak stepsize (not to be confused with Polyak’s acceleration) is a principled method to update the step size for convex problems. In its default state, it does not require a line search, but it can be used as an initial guess in one. The method requires that the user know the optimal value $f^* = f(x^*)$, but not $x^*$ for some convex minimization problem a-priori. By using this infomation, a stepsize $\eta_t$ is chosen at each step $t$ using the objective $f_t$ evaluated at its vector argument $x_t$.
\begin{equation} \eta_t = \frac{f_t-f^*}{\lVert \nabla f_t \rVert} \end{equation}
This is a very old method [1], however it has gained some recent attention in Machine Learning particularly as it is often known a-priori that the loss will be the minimum possible, $0$. This is called the interpolation-regime and why/when modern machine learning models, particularly neural networks, are able to perform well in it is an active subject of research. Given this knowledge, techniques like ALI-G [2] take this very simple stepsize argument and modify SGD to use a stepsize of $\min \left[ \alpha*\eta_t, \gamma \right]$ for some constants $\alpha < 1, \gamma$ that control how large the stepsize can get. It has been observed that this usually causes the stepsize to quickly rise and then gradually fall as the model is fit. The reason $\alpha, \gamma$ are needed is not only that we are sampling subsets of the data to compute gradients, but also because these problems are inherently nonconvex, which technically makes these techniques no longer rigorously “work”.
In this note, I will propose that there is more information that can be used in these interpolatory regime cases that allows for an even better first-order stepsize method to be constructed. Specifically I will prove how there is a method that can use information on the optimal value attained at the optimum for each datum separately (assuming it is available). This method follows simply from adding this information to the proof that the Polyak stepsize minimizes an inequality that comes directly from convexity. This assumes that the function being optimized is a sum of convex functions $\sum_{i=1}^{n} f_i$. We will write $f^*_i$ to denote $f_i(x^*)$. Note that there is only one $x^*$ here, the one that optimizes the overall sum, and this technique will work whenever we know these values a-priori, so it may apply in settings outside of the interpolation regime. Specifically, the easiest source to work from is probably [3] that has nice proofs of the stepsize as well as rates and a method to adaptively find $f^*$ when it isn’t known a-priori. Reproducing (3) from that paper, but with a new subscript $i$ where relevant, we have: (pardon the formatting, MathJax is acting up…)
\begin{equation}
d_{t+1}^2 = \lVert x_{t+1} - x^* \rVert^2 \nonumber
\end{equation}
\begin{equation}
= \lVert x_t - \sum_i \eta_{t,i} \nabla_{t,i} - x^* \rVert \nonumber
\end{equation}
\begin{equation}
= d_t^2 - 2 \sum_i \eta_{t,i} \nabla_{t,i} (x_t - x^*) + \sum_j (\sum_i \eta_{t,i} \nabla_{t,i}^{(j)})^2 \nonumber
\end{equation}
\begin{equation}
\leq d_t^2 - 2 \sum_i \eta_{t,i} h_{t,i} + \sum_j (\sum_i \eta_{t,i} \nabla_{t,i}^{(j)})^2 \label{basic}
\end{equation}
This results in a quadratic minimization problem that naturally yields a solution as a solution of linear equations when we set the gradient to zero. When we put those equations together, we get the solution has a familiar form:
\begin{equation} \eta_{t,*} = (\nabla_{t,*}\nabla_{t,*}^\top)^{-1}h_{t,*} \end{equation}
So, how well does it perform compared to standard Polyak gradient? I’ll lead with the bad news: for problems that naturally compose into larger problems of the same class when summed (convex quadratic minimization, f.e.), it doesn’t seem to work better than Polyak stepsize. When trying it out on a small problem in this class with $5$ functions and $40$ variables, both methods take 27 steps to reach an accuracy one might call converged ($1e-6$).
When we choose a funkier problem class is where this method shines (which is good, because neural networks are quite funky). I minimized the sum of $1$ quadratic and $4$ exponentiated-quadratics (exp preserves convexity) and found that neither method converged. In the first $10$ steps, however, this stepsize achieves an objective value around 100% larger than the target (8.8k vs 5k) that is able to be reduced slightly in 50 more (7.7k), unfortunately after that there is a numerics problem and the method wildly diverges, reaching a value >100k by 100 iterations. This is a problem, but let’s compare that to how Polyak step sizes do. In $10$ steps, Polyak has a loss of 44.8k, in $60$, 42.0k, and in 400, 17.6k. Not too shabby for our method.
Interestingly, Polyak step sizes results in a distance in $x$-space that is closer to $x^*$ by a proportion of roughly 1/3 when compared to Multiple-Polyak (ours). That could be because Multiple Polyak is better able to find curvature in the loss which allows it to minimize the loss faster at the expense of slower descent in parameter space. If that is a problem, it’s probably not worth it to use this method. This also means that either the bound is very loose in this case, so optimizing it is not actually producing gains, or numerics are causing a problem even earlier than the point at which the algorithm fails outright.
In conclusion I’ve introduced an interesting new method for automatically determining the learning rate when the aim is to find a solution in the interpolation regime. Next update I will show how this can be plugged into an adaptive method (variant of Adam) and used to train neural networks.
[1] B. Polyak, Introduction to optimization. translations series in mathematics and engineering.
[2] Berrada et. al., Training Neural Networks for and by Interpolation
[3] E. Hazan and S. Kakade, Revisiting the Polya Step Size