next up previous contents
Next: A simple trust-region algorithm. Up: An introduction to the Previous: An introduction to the   Contents

Subsections

Trust Region and Line-search Methods.

In continuous optimization, you have basically the choice between two kind of algorithm: In this section we will motivate the choice of a trust region algorithm for unconstrained optimization. We will see that trust region algorithms are a natural evolution of the Line-search algorithms.

Conventions

$ \mbox{$\cal F$}$$ (x): \Re^n \rightarrow \Re$ is the objective function. We search for the minimum $ x^*$ of it.
$ x^*$ The optimum. We search for it.
$ \displaystyle g_i=\frac{\partial \mbox{$\cal F$}}{\partial x_i}$ $ g$ is the gradient of $ \cal F$.
$ \displaystyle H_{i,j}=\frac{\partial^2
\mbox{$\cal F$}}{\partial x_i \partial
x_j} $ $ H$ is the Hessian matrix of $ \cal F$.
$ H_k=H(x_k)$ The Hessian Matrix of F at point $ x_k$
$ B_k=B(x_k)$ The current approximation of the Hessian Matrix of F at point $ x_k$.
If not stated explicitly, we will always assume $ B=H$.
$ B^*=B(x^*)$ The Hessian Matrix at the optimum point.
$ \mbox{$\cal F$}$$ (x+\delta ) \approx$   $ \mbox{$\cal Q$}$$ (\delta) =$   $ \mbox{$\cal F$}$$ (x)+g^t
\delta+\frac{1}{2}\delta^t B \delta$ $ \mbox{$\cal Q$}$ is the quadratical approximation of $ \mbox{$\cal F$}$ around x.

All vectors are column vectors.
In the following section we will also use the following convention.
$ k$ $ k$ is the iteration index of the algorithm.
$ s_k$ $ s_k$ is the direction of research. Conceptually, it's only a direction not a length.
$ \delta_k$ $ \delta_k=\alpha_k s_k = x_{k+1}-x_k$ is the step performed at iteration $ k$.
$ \alpha_k$ $ \alpha_k$ is the length of the step preformed at iteration k.
$ \mbox{$\cal F$}$$ _k$ $ \mbox{$\cal F$}$$ _k=$$ \mbox{$\cal F$}$$ (x_k)$
$ g_k$ $ g_k=g(x_k)$
$ h_k$ $ \Vert h_k\Vert=\Vert x_k-x^*\Vert$ the distance from the current point to the optimum.

General principle

The outline of a simple optimization algorithm is:
  1. Search for a descent direction $ s_k$ around $ x_k$ ($ x_k$ is the current position).
  2. In the direction $ s_k$, search the length $ \alpha_k$ of the step.
  3. Make a step in this direction: $ x_{k+1}=x_k+\delta_k$ (with $ \delta_k=\alpha_k s_k$)
  4. Increment $ k$. Stop if $ g_k\approx 0$ otherwise, go to step 1.
A simple algorithm based on this canvas is the ``steepest descent algorithm'':

$\displaystyle s_k=-g_k$ (2.1)

We choose $ \alpha=1 \Longrightarrow
\delta_k=s_k=-g_k$ and therefore:

$\displaystyle x_{k+1}=x_k-g_k$ (2.2)

This is a very slow algorithm: it converges linearly (see next section about convergence speed).

Notion of Speed of convergence.

linear convergence $ \Vert x_{k+1} - x^* \Vert < \epsilon \; \Vert x_k - x^* \Vert $
superlinear convergence $ \Vert x_{k+1} - x^* \Vert < \epsilon_k \Vert x_k
-
x^* \Vert $ with $ \epsilon_k \rightarrow 0$
quadratic convergence $ \Vert x_{k+1} - x^* \Vert < \epsilon \; \Vert x_k - x^* \Vert^2 $

with $ 0 \leq \epsilon < 1$
These convergence criterias are sorted in function of their speed, the slowest first.
Reaching quadratic convergence speed is very difficult.
Superlinear convergence can also be written in the following way:

$\displaystyle \lim_{k \rightarrow \infty } \frac{ \Vert x_{k+1} - x^* \Vert }{ \Vert
 x_k - x^* \Vert } = 0$ (2.3)

A simple line-search method: The Newton's method.

We will use $ \alpha=1 \Longrightarrow \delta_k=s_k$.
We will use the curvature information contained inside the Hessian matrix $ B$ of $ \mbox{$\cal F$}$ to find the descent direction. Let us write the Taylor development limited to the degree 2 of $ \mbox{$\cal F$}$ around $ x$:

   $\displaystyle \mbox{$\cal F$}$$\displaystyle (x+\delta ) \approx$   $\displaystyle \mbox{$\cal Q$}$$\displaystyle (\delta) =$$\displaystyle \mbox{$\cal F$}$$\displaystyle (x)+g^t
\delta+\frac{1}{2}\delta^t B \delta $

The unconstrained minimum of $ \mbox{$\cal Q$}$$ (\delta)$ is:
$\displaystyle \nabla$$\displaystyle \mbox{$\cal Q$}$$\displaystyle (\delta_k)= g_k+ B_k \delta_k$ $\displaystyle =$ 0  
$\displaystyle \Longleftrightarrow B_k \delta_k$ $\displaystyle =$ $\displaystyle -g_k$ (2.4)

Equation 2.4 is called the equation of the Newton's step $ \delta_k$.
So, the Newton's method is:
  1. solve $ B_k \delta_k=-g_k$ (go to the minimum of the current quadratical approximation of $ \mbox{$\cal F$}$).
  2. set $ x_{k+1}=x_k+\delta_k$
  3. Increment $ k$. Stop if $ g_k\approx 0$ otherwise, go to step 1.
In more complex line-search methods, we will run a one-dimensional search ($ n=1$) in the direction $ \delta_k=s_k$ to find a value of $ \alpha_k>0$ which ``reduces sufficiently'' the objective function. (see Section 13.1.2 for conditions on $ \alpha $: the Wolf conditions. A sufficient reduction is obtained if you respect these conditions.)

Near the optimum, we must always take $ \alpha=1$, to allow a ``full step of one'' to occur: see Chapter 2.1.6 for more information.

Newton's method has quadratic convergence: see Section 13.1.1 for proof.


$ B_k$ must be positive definite for Line-search methods.

In the Line-search methods, we want the search direction $ \delta_k$ to be a descent direction.

$\displaystyle \Longrightarrow \delta^T g < 0$ (2.5)

Taking the value of g from 2.4 and putting it in 2.5, we have:
$\displaystyle - \delta^T B \delta$ $\displaystyle <$ 0  
$\displaystyle \Leftrightarrow \quad \delta^T B \delta$ $\displaystyle >$ 0 (2.6)

The Equation 2.6 says that $ B_k$ must always be positive definite. In line-search methods, we must always construct the $ B$ matrix so that it is positive definite.

One possibility is to take $ B=I$ ($ I$=identity matrix), which is a very bad approximation of the Hessian $ H$ but which is always positive definite. We retrieve the ``steepest descent algorithm''.

Another possibility, if we don't have $ B$ positive definite, is to use instead $ B_{new}=B+\lambda I$ with $ \lambda $ being a very big number, such that $ B_{new}$ is positive definite. Then we solve, as usual, the Newton's step equation (see 2.4) $ B_{new} \delta_k=-g_k$. Choosing a high value for $ \lambda $ has 2 effects:
  1. $ B$ will become negligible and we will find, as search direction, ``the steepest descent step''.
  2. The step size $ \Vert \delta_k \Vert$ is reduced.
In fact, only the above second point is important. It can be proved that, if we impose a proper limitation on the step size $ \Vert
\delta_k \Vert < \Delta_k $, we maintain global convergence even if $ B$ is an indefinite matrix. Trust region algorithms are based on this principle. ($ \Delta_k $ is called the trust region radius).

The old Levenberg-Marquardt algorithm uses a technique which adapts the value of $ \lambda $ during the optimization. If the iteration was successful, we decrease $ \lambda $ to exploit more the curvature information contained inside $ B$. If the previous iteration was unsuccessful, the quadratic model don't fit properly the real function. We must then only use the ``basic'' gradient information. We will increase $ \lambda $ in order to follow closely the gradient (''steepest descent algorithm'').

For intermediate value of $ \lambda $, we will thus follow a direction which is a mixture of the ``steepest descent step'' and the ``Newton step''. This direction is based on a perturbated Hessian matrix and can sometime be disastrous (There is no geometrical meaning of the perturbation $ \lambda I$ on $ B$).

This old algorithm is the base for the explanation of the update of the trust region radius in Trust Region Algorithms. However, in Trust Region Algorithms, the direction $ \delta_k$ of the next point to evaluate is perfectly controlled.

To summarize:


Why is Newton's method crucial: Dennis-Moré theorem

The Dennis-Moré theroem is a very important theorem. It says that a non-linear optimization algorithm will converge superlinearly at the condition that, asymptotically, the steps made are equals to the Newton's steps ( $ A_k \delta_k=-g_k$ ($ A_k$ is not the Hessian matrix of $ \mbox{$\cal F$}$; We must have: $ A_k \rightarrow B_k$ (and $ B_k
\rightarrow H_k$ )). This is a very general result, applicable also to constrained optimization, non-smooth optimization, ...

\begin{figure}
\centering\fbox{\hspace{0.2cm}\parbox[t][5.2cm][b]{15.7cm}{
...
...\Vert}{\Vert\delta_k\Vert}=0\end{equation}} } }
\vspace{-0.1cm}
\end{figure}


definition:
A function $ g(x)$ is said to be Lipchitz continuous if there exists a constant $ \gamma$ such that:

$\displaystyle \Vert g(x)-g(y)\Vert \leq \gamma \Vert x-y \Vert$ (2.7)


To make the proof, we first see two lemmas.

Lemma 1.

\fbox{\hspace{0.2cm}\parbox[t][1.7cm][b]{15.7cm}{
{\em We will prove that if ...
...\Vert
( \Vert v-x\Vert + \Vert u-x\Vert )
\end{equation}\vspace{-.5cm} }
} }

The well-known Riemann integral is:

   $\displaystyle \mbox{$\cal F$}$$\displaystyle (x+\delta)-$$\displaystyle \mbox{$\cal F$}$$\displaystyle (x)= \int_x^{x+\delta} g(z) dz \nonumber $

The extension to the multivariate case is straight forward:

$\displaystyle g(x+\delta)-g(x)= \int_x^{x+\delta} H(z) dz $

After a change in the integration variable: $ z=x+\delta t
\Rightarrow dz= \delta dt$, we obtain:

$\displaystyle g(x+\delta)-g(x)= \int_0^{1} H(x+t \delta) \delta dt $

We substract on the left side and on the right side $ H(x)\delta$:

$\displaystyle g(x+\delta)-g(x)-H(x)\delta = \int_0^{1} ( H(x+t \delta) -
 H(x)) \delta dt$ (2.8)

Using the fact that $ \displaystyle \Vert \int_0^1 H(t) dt \Vert \leq
\int_0^1 \Vert H(t) \Vert dt$, and the cauchy-swartz inequality $ \Vert a \;
b\Vert< \Vert a\Vert \Vert b\Vert$ we can write:
$\displaystyle \Vert
g(x+\delta)-g(x)-H(x)\delta \Vert$ $\displaystyle \leq$ $\displaystyle \int_0^{1} \Vert H(x+t\delta)
- H(x) \Vert \Vert\delta\Vert dt \nonumber$  


Using the fact that the Hessian $ H$ is Lipchitz continuous (see Equation 2.9):
$\displaystyle \Vert
g(x+\delta)-g(x)-H(x)\delta \Vert$ $\displaystyle \leq$ $\displaystyle \int_0^{1} \gamma
\Vert t \delta \Vert \Vert \delta \Vert dt$  
  $\displaystyle \leq$ $\displaystyle \gamma \Vert \delta \Vert^2 \int_0^{1} t \; dt$  
  $\displaystyle \leq$ $\displaystyle \frac{\gamma}{2} \Vert\delta\Vert^2$ (2.9)

The lemma 2 can be directly deduced from Equation 2.12.

Lemma 2.

\fbox{\hspace{0.2cm}\parbox[t][2.7cm][b]{15.7cm}{
{\em If $H$\ is Lipchitz co...
...hich
respect $ \max( \Vert v-x\Vert, \Vert u-x\Vert) \leq \epsilon$.\\ }
} }

If we write the triangle inequality $ \Vert a+b\Vert<\Vert a\Vert+\Vert b\Vert$ with $ a=g(v)-g(u)$ and $ b=H(x)(v-u)+g(u)-g(v)$ we obtain:

$\displaystyle \Vert g(v)-g(u) \Vert \geq \Vert H(x)(v-u) \Vert - \Vert g(v)-g(u)-H(x)(v-u)\Vert
$


Using the cauchy-swartz inequality $ \displaystyle \Vert a \; b\Vert<
\Vert a\Vert \Vert b\Vert \Leftrightarrow \Vert b\Vert> \frac{\Vert a \; b\Vert}{\Vert a\Vert}$ with $ a=H^{-1}$ and $ b=H(v-u)$, and using Equation 2.10:

$\displaystyle \Vert g(v)-g(u) \Vert \geq \Bigg[ \frac{1}{\Vert H(x)^{-1} \Vert} -
\frac{\gamma}{2} (\Vert v-x\Vert+\Vert u-x\Vert) \Bigg] \Vert v-u\Vert $


Using the hypothesis that $ \max( \Vert v-x\Vert, \Vert u-x\Vert) \leq
\epsilon$:

$\displaystyle \Vert g(v)-g(u) \Vert \geq \Bigg[ \frac{1}{\Vert H(x)^{-1} \Vert} - \gamma
\epsilon \Bigg] \Vert v-u\Vert $


Thus if $ \displaystyle \epsilon < \frac{1}{\Vert H(x)^{-1} \Vert
\gamma}$ the lemma is proven with $ \displaystyle
\alpha=\frac{1}{\Vert H(x)^{-1} \Vert} - \gamma \epsilon$.

Proof of Dennis-moré theorem.

We first write the ``step'' Equation 2.7:
$\displaystyle A_k \delta_k$ $\displaystyle =$ $\displaystyle -g_k$  
0 $\displaystyle =$ $\displaystyle A_k \delta_k + g_k$  
0 $\displaystyle =$ $\displaystyle (A_k-H^*) \delta_k + g_k + H^* \delta_k$  
$\displaystyle -g_{k+1}$ $\displaystyle =$ $\displaystyle (A_k-H^*) \delta_k + [ -g_{k+1} + g_k + H^* \delta_k ]$  
$\displaystyle \frac{\Vert g_{k+1}\Vert}{\Vert\delta_k\Vert}$ $\displaystyle \leq$ $\displaystyle \frac{\Vert(A_k-H^*)
\delta_k\Vert}{\Vert \delta_k \Vert}+ \frac{\Vert -g_{k+1} + g_k + H^*
\delta_k \Vert}{\Vert \delta_k \Vert}$  

Using lemma 2: Equation 2.10, and defining $ e_k=x_k-x^*$, we obtain:

$\displaystyle \frac{\Vert g_{k+1}\Vert}{\Vert\delta_k\Vert} \leq
\frac{\Vert(A...
...}{\Vert \delta_k \Vert}+ \frac{\gamma}{2}
(\Vert e_k\Vert+\Vert e_{k+1}\Vert) $

Using $ \lim_{k \rightarrow \infty} \Vert e_k\Vert=0$ and Equation 2.8:

$\displaystyle \lim_{k \rightarrow \infty}
 \frac{\Vert g_{k+1}\Vert}{\Vert\delta_k\Vert} =0$ (2.10)

Using lemma 3: Equation 2.13, there exists $ \alpha>0,
k_0 \geq 0$, such that, for all $ k>k_0$, we have (using $ g(x^*)=0$):

$\displaystyle \Vert g_{k+1}\Vert= \Vert g_{k+1} - g(x^*) \Vert \geq \alpha
 \Vert e_{k+1}\Vert$ (2.11)

Combing Equation 2.14 and 2.15,
0 $\displaystyle =$ $\displaystyle \lim_{k \rightarrow \infty} \frac{\Vert g_{k+1}\Vert}{\Vert\delta...
...lim_{k \rightarrow \infty} \alpha
\frac{\Vert e_{k+1}\Vert}{\Vert\delta_k\Vert}$  
  $\displaystyle \geq$ $\displaystyle \lim_{k \rightarrow \infty} \alpha \frac{\Vert e_{k+1}\Vert}{\Ver...
...\Vert+\Vert e_{k+1}\Vert}= \lim_{k \rightarrow \infty} \frac{\alpha
r_k}{1+r_k}$ (2.12)

where we have defined $ \displaystyle
r_k=\frac{\Vert e_{k+1}\Vert}{\Vert e_k\Vert}$. This implies that:

$\displaystyle \lim_{k
 \rightarrow \infty} r_k =0$ (2.13)

which completes the proof of superlinear convergence.
Since $ g(x)$ is Lipschitz continuous, it's easy to show that the Dennis-moré theorem remains true if Equation 2.8 is replaced by:

$\displaystyle \lim_{k \rightarrow \infty} \frac{\Vert
 (A_k-H_k)\delta_k\Vert}{...
... \rightarrow \infty}
 \frac{\Vert (A_k-B_k)\delta_k\Vert}{\Vert\delta_k\Vert}=0$ (2.14)


This means that, if $ A_k \rightarrow B_k$, then we must have $ \alpha_k$ (the length of the steps) $ \rightarrow 1$ to have superlinear convergence. In other words, to have superlinar convergence, the ``steps'' of a secant method must converge in magnitude and direction to the Newton steps (see Equation 2.4) of the same points. A step with $ \alpha_k=1$ is called a ``full step of one''. It's necessary to allow a ``full step of one'' to take place when we are near the optimum to have superlinear convergence. The Wolf conditions (see Equations 13.4 and 13.5 in Section 13.1.2) always allow a ``full step of one''.

When we will deal with constraints, it's sometime not possible to have a ``full step of one'' because we ``bump'' into the frontier of the feasible space. In such cases, algorithms like FSQP will try to ``bend'' or ``modify'' slightly the search direction to allow a ``full step of one'' to occur.

This is also why the ``trust region radius'' $ \Delta_k $ must be large near the optimum to allow a ``full step of one'' to occur.
next up previous contents
Next: A simple trust-region algorithm. Up: An introduction to the Previous: An introduction to the   Contents
Frank Vanden Berghen 2004-04-19