Levenberg-Marquardt algorithms
vs
Trust Region algorithms

Frank Vanden Berghen

IRIDIA, Université Libre de Bruxelles
50, av. Franklin Roosevelt
1050 Brussels, Belgium
Tel: +32 2 650 27 29, Fax: 32 2 650 27 15

PDF version of this
business-insight
page is available here.

For an in-depth explanation and more references about this subject, see my thesis, section 2.1.

Let's assume that we want to find $ x^*$, the
business-insight
minimum of the objective function $ \mbox{$\cal F$}$$ (x): \Re^n \rightarrow \Re$.
Let us write the Taylor development limited to the degree 2 of $ \mbox{$\cal F$}$ around $ x_k$:

   $\displaystyle \mbox{$\cal F$}$$\displaystyle (x_k+\delta ) \approx$   $\displaystyle \mbox{$\cal Q$}$$\displaystyle _k (\delta) =$$\displaystyle \mbox{$\cal F$}$$\displaystyle (x_k)+g_k^t
\delta+\frac{1}{2}\delta^t B_k \delta $

with
For the moment, we will assume that $ B_k:=H_k$.
The unconstrained minimum $ \delta_k$ of $ \mbox{$\cal Q$}$$ (\delta)$ is:
$\displaystyle \nabla$   $\displaystyle \mbox{$\cal Q$}$$\displaystyle (\delta_k)= g_k+ B_k \delta_k$ $\displaystyle =$ 0  
$\displaystyle \Longleftrightarrow B_k \delta_k$ $\displaystyle =$ $\displaystyle -g_k$ (1)

Equation 1 is called the
business-insight
equation of the Newton Step $ \delta_k$.
So, the Newton's method to find the minimum $ x^*$ of $ \mbox{$\cal F$}$$ (x)$ is:
  1. Set $ k=0$. Set $ x_0=x_{start}$.
  2. solve $ B_k \delta_k=-g_k$ (go to the minimum of the current quadratical approximation of $ \mbox{$\cal F$}$).
  3. set $ x_{k+1}=x_k+\delta_k$
  4. Increment $ k$. Stop if $ g_k\approx 0$ otherwise, go to step 2.
Newton's method is VERY fast: when $ x_k$ is close to $ x^*$ (when we are near the optimum) this method has quadratical convergence speed:

$\displaystyle \Vert x_{k+1} - x^* \Vert < \epsilon \; \Vert x_k - x^* \Vert^2 $

with $ \epsilon<1$. Unfortunately, it does NOT always converge to the minimum $ x^*$ of $ \mbox{$\cal F$}$$ (x)$. To have convergence, we need to be sure that $ B_k$ is always positive definite, ie. that the curvature of $ \mbox{$\cal F$}$$ (x)$ is always positive.


PROOF: $ B_k$ must be positive definite to have convergence.
We want the search direction $ \delta_k$ to be a descent direction
$\displaystyle \Longrightarrow \delta^T g < 0$ (2)

Taking the value of $ g$ from (1) and putting it in (2), we have:
$\displaystyle - \delta^T B \delta$ $\displaystyle <$ 0  
$\displaystyle \Leftrightarrow \quad \delta^T B \delta$ $\displaystyle >$ 0 (3)

The Equation (3) says that $ B_k$ must always be positive definite.
END OF PROOF.


So, we must always construct the $ B_k$ matrix so that it is a positive definite approximation of $ H_k$, the real Hessian matrix of $ \mbox{$\cal F$}$$ (x)$. The Newton's Algorithm cannot use negative curvature (when $ H_k$ negative definite) inside $ \mbox{$\cal F$}$$ (x)$. . See figure 1 for an illustration about positive/negative curvature.
Figure 1: positive/negative curvature of a function $ f(x):\Re \rightarrow \Re $
\begin{figure}
\centering\epsfig{figure=curvature.eps, width=16cm, height=6.5cm}
\end{figure}


One possibility to solve this problem is to take $ B_k=I$ ($ I$=identity matrix), which is a very bad approximation of the Hessian $ H$ but which is always positive definite. We will simply have $ \delta_k=-g_k$. We will simply follow the slope. This algorithm is called the ``steepest descent algorithm''. It is very slow. It has linear speed of convergence: $ \Vert x_{k+1} - x^* \Vert <
\epsilon \; \Vert x_k - x^* \Vert $ with $ \epsilon<1$ (problem is when $ \epsilon=0.99$).


Another possibility, if we don't have $ H_k$ positive definite, is to use instead $ B_{new,k}=H_k+\lambda I$ with $ \lambda$ being a very big number, such that $ B_{new,k}$ is positive definite. Then we solve, as usual, the Newton Step equation (see equation (1)):
$\displaystyle B_{new,k} \; \delta_k=-g_k
 \;\;\;\; \Leftrightarrow \;\;\;\; (H_k+\lambda I)\delta_k=-g_k$ (4)

Choosing a high value for $ \lambda$ has 2 effects:
  1. $ H_k$ (inside equation $ B_{new,k}=H_k+\lambda I$ ) will become negligible and we will find, as search direction, ``the steepest descent step''.
  2. The step size $ \Vert \delta_k \Vert$ is reduced.
In reality, only the above second point is important. It can be proven that, if we impose a proper limitation on the step size $ \Vert
\delta_k \Vert < \Delta_k $, we maintain global convergence even if $ B_k$ is an indefinite matrix. Trust region algorithms are based on this principle ($ \Delta_k $ is called the trust region radius). In trust region algorithm the steps $ \delta_k$ are:

$\displaystyle \delta_k$    is the solution of    $\displaystyle \mbox{$\cal Q$}$$\displaystyle (\delta_k)= \min_{\delta} Q(\delta)$    subject to $\displaystyle \Vert \delta \Vert < \Delta_k$ (5)

The old Levenberg-Marquardt algorithm uses a technique which adapts the value of $ \lambda$ during the optimization. If the iteration was successful ( $ \mbox{$\cal F$}$$ (x_k+\delta_k)<$$ \mbox{$\cal F$}$($ (\delta_k)$)), we decrease $ \lambda$ to exploit more the curvature information contained inside $ H_k$. If the previous iteration was unsuccessful ( $ \mbox{$\cal F$}$$ (x_k+\delta_k)>$$ \mbox{$\cal F$}$($ (\delta_k)$)), the quadratic model don't fit properly the real function. We must then only use the ``basic'' gradient information. We will increase $ \lambda$ in order to follow closely the gradient (``steepest descent algorithm''). This old algorithm is the base for the explanation of the update of the trust region radius $ \Delta_k $ in Trust Region Algorithms.


For intermediate value of $ \lambda$, we will thus follow a direction which is a mixture of the ``steepest descent step'' and the ``Newton Step''. This direction is based on a perturbated Hessian matrix $ B_{new,k}$ and can sometime be disastrous (There is no geometrical meaning of the perturbation $ \lambda I$ on $ H_k$).


When a negative curvature is encountered ($ H_k$ negative definite): Trust Region algorithm will thus exhibit better performances each time a negative curvature is encountered and have thus better performances than all the Levenberg-Marquardt algorithms. Unfortunately, the computation of $ \delta_k$ for Trust Region algorithm involves a constrained minimization of a quadratic subject to one non-linear constraint (see equation (5)). This is not a trivial problem to solve at all. The algorithmic complexity of Trust region algorithms is much higher. This explains why they are not very often encountered despite their better performances. The solution of equation (5) can be computed very efficiently using the fast algorithm from Moré and Sorensen.


There is one last point which must still be taken into account: How can we obtain $ H_k$? Usually, we don't have the analytical expression of $ H_k$. $ H_k$ must thus be approximated numerically. $ H_k$ is usually constructed iteratively based on information gathered from old evaluations $ \mbox{$\cal F$}$$ (x_j)$ for $ j=1,\ldots,k$. Iterative construction of $ H_k$ can be based on:

 

To summarize:


Old Levenberg-Marquardt algorithms were updating iteratively $ H_k$ only on the iterations $ k$ for which a good value for $ \lambda$ has been found (This is because on old computers the update of $ H_k$ is very time consuming, so we want to avoid it). Modern Levenberg-Marquardt algorithms are updating iteratively $ H_k$ at every iterations $ k$ but they are still unable to follow a negative curvature inside the function $ \mbox{$\cal F$}$$ (x)$. The steps $ \delta_k$ remains thus of poor quality compared to trust region algorithms.

Levenberg-Marquardt algorithms are very often used to fit a non-linear function through a set of points. Usually, to solve the fit, we will try to minimize the sum of the squared intercept error (SOSE). In this case, the objective function is convex and has always a positive curvature in all the point of the space. Levenberg-Marquardt algorithms will thus have no problem and are usually well suited or these kinds of problem.
business-insight


To summarize: Trust Region Methods are an evolution of the Levenberg-Marquardt algorithms. Trust Region Methods are able to follow the negative curvature of the objective function. Levenberg-Marquardt algorithms are NOT able to do so and are thus slower.