Next: A simple trust-region algorithm. Up: An introduction to the Previous: An introduction to the Contents

Subsections

Conventions
General principle
Notion of Speed of convergence.
A simple line-search method: The Newton's method.
must be positive definite for Line-search methods.
Why is Newton's method crucial: Dennis-Moré theorem

Trust Region and Line-search Methods.

In continuous optimization, you have basically the choice between two kind of algorithm:

Line-search methods
Trust region methods.

In this section we will motivate the choice of a trust region algorithm for unconstrained optimization. We will see that trust region algorithms are a natural evolution of the Line-search algorithms.

Conventions

$\mbox{$\cal F$}$ $(x): \Re^n \rightarrow \Re$ is the objective function. We search for the minimum

of it.

	The optimum. We search for it.
$\displaystyle g_i=\frac{\partial \mbox{$\cal F$}}{\partial x_i}$	is the gradient of $\cal F$ .
$\displaystyle H_{i,j}=\frac{\partial^2 \mbox{$\cal F$}}{\partial x_i \partial x_j}$	is the Hessian matrix of $\cal F$ .
	The Hessian Matrix of F at point
	The current approximation of the Hessian Matrix of F at point . If not stated explicitly, we will always assume .
	The Hessian Matrix at the optimum point.
$\mbox{$\cal F$}$ $(x+\delta ) \approx$ $\mbox{$\cal Q$}$ $(\delta) =$ $\mbox{$\cal F$}$ $(x)+g^t \delta+\frac{1}{2}\delta^t B \delta$	$\mbox{$\cal Q$}$ is the quadratical approximation of $\mbox{$\cal F$}$ around x.

All vectors are column vectors.
In the following section we will also use the following convention.

	is the iteration index of the algorithm.
	is the direction of research. Conceptually, it's only a direction not a length.
$\delta_k$	$\delta_k=\alpha_k s_k = x_{k+1}-x_k$ is the step performed at iteration .
$\alpha_k$	$\alpha_k$ is the length of the step preformed at iteration k.
$\mbox{$\cal F$}$	$\mbox{$\cal F$}$ $\mbox{$\cal F$}$

	$\Vert h_k\Vert=\Vert x_k-x^*\Vert$ the distance from the current point to the optimum.

General principle

The outline of a simple optimization algorithm is:

Search for a descent direction around ( is the current position).
In the direction , search the length $\alpha_k$ of the step.
Make a step in this direction: $x_{k+1}=x_k+\delta_k$ (with $\delta_k=\alpha_k s_k$ )
Increment . Stop if $g_k\approx 0$ otherwise, go to step 1.

A simple algorithm based on this canvas is the ``steepest descent algorithm'':

$\displaystyle s_k=-g_k$

(2.1)

We choose $\alpha=1 \Longrightarrow \delta_k=s_k=-g_k$ and therefore:

$\displaystyle x_{k+1}=x_k-g_k$

(2.2)

This is a very slow algorithm: it converges linearly (see next section about convergence speed).

Notion of Speed of convergence.

linear convergence	$\Vert x_{k+1} - x^* \Vert < \epsilon \; \Vert x_k - x^* \Vert$
superlinear convergence	$\Vert x_{k+1} - x^* \Vert < \epsilon_k \Vert x_k - x^* \Vert$ with $\epsilon_k \rightarrow 0$
quadratic convergence	$\Vert x_{k+1} - x^* \Vert < \epsilon \; \Vert x_k - x^* \Vert^2$

with $0 \leq \epsilon < 1$
These convergence criterias are sorted in function of their speed, the slowest first.
Reaching quadratic convergence speed is very difficult.
Superlinear convergence can also be written in the following way:

$\displaystyle \lim_{k \rightarrow \infty } \frac{ \Vert x_{k+1} - x^* \Vert }{ \Vert x_k - x^* \Vert } = 0$

(2.3)

A simple line-search method: The Newton's method.

We will use $\alpha=1 \Longrightarrow \delta_k=s_k$ .
We will use the curvature information contained inside the Hessian matrix

of $\mbox{$\cal F$}$ to find the descent direction. Let us write the Taylor development limited to the degree 2 of $\mbox{$\cal F$}$ around

:

$\displaystyle \mbox{$\cal F$}$ $\displaystyle (x+\delta ) \approx$ $\displaystyle \mbox{$\cal Q$}$ $\displaystyle (\delta) =$ $\displaystyle \mbox{$\cal F$}$ $\displaystyle (x)+g^t \delta+\frac{1}{2}\delta^t B \delta$

The unconstrained minimum of $\mbox{$\cal Q$}$ $(\delta)$ is:

$\displaystyle \nabla$ $\displaystyle \mbox{$\cal Q$}$ $\displaystyle (\delta_k)= g_k+ B_k \delta_k$	$\displaystyle =$	0
$\displaystyle \Longleftrightarrow B_k \delta_k$	$\displaystyle =$	$\displaystyle -g_k$	(2.4)

Equation 2.4 is called the equation of the Newton's step $\delta_k$ .
So, the Newton's method is:

solve $B_k \delta_k=-g_k$ (go to the minimum of the current quadratical approximation of $\mbox{$\cal F$}$ ).
set $x_{k+1}=x_k+\delta_k$
Increment . Stop if $g_k\approx 0$ otherwise, go to step 1.

In more complex line-search methods, we will run a one-dimensional search (

) in the direction $\delta_k=s_k$ to find a value of $\alpha_k>0$ which ``reduces sufficiently'' the objective function. (see Section 13.1.2 for conditions on $\alpha$ : the Wolf conditions. A sufficient reduction is obtained if you respect these conditions.)

Near the optimum, we must always take $\alpha=1$ , to allow a ``full step of one'' to occur: see Chapter 2.1.6 for more information.

Newton's method has quadratic convergence: see Section 13.1.1 for proof.

must be positive definite for Line-search methods.

In the Line-search methods, we want the search direction $\delta_k$ to be a descent direction.

$\displaystyle \Longrightarrow \delta^T g < 0$

(2.5)

Taking the value of g from 2.4 and putting it in 2.5, we have:

$\displaystyle - \delta^T B \delta$ $\displaystyle <$ 0
$\displaystyle \Leftrightarrow \quad \delta^T B \delta$ $\displaystyle >$ 0	(2.6)

The Equation 2.6 says that

must always be positive definite. In line-search methods, we must always construct the

matrix so that it is positive definite.

One possibility is to take

(

=identity matrix), which is a very bad approximation of the Hessian

but which is always positive definite. We retrieve the ``steepest descent algorithm''.

Another possibility, if we don't have

positive definite, is to use instead $B_{new}=B+\lambda I$ with $\lambda$ being a very big number, such that $B_{new}$ is positive definite. Then we solve, as usual, the Newton's step equation (see 2.4) $B_{new} \delta_k=-g_k$ . Choosing a high value for $\lambda$ has 2 effects:

will become negligible and we will find, as search direction, ``the steepest descent step''.
The step size $\Vert \delta_k \Vert$ is reduced.

In fact, only the above second point is important. It can be proved that, if we impose a proper limitation on the step size $\Vert \delta_k \Vert < \Delta_k$ , we maintain global convergence even if

is an indefinite matrix. Trust region algorithms are based on this principle. ( $\Delta_k$ is called the trust region radius).

The old Levenberg-Marquardt algorithm uses a technique which adapts the value of $\lambda$ during the optimization. If the iteration was successful, we decrease $\lambda$ to exploit more the curvature information contained inside

. If the previous iteration was unsuccessful, the quadratic model don't fit properly the real function. We must then only use the ``basic'' gradient information. We will increase $\lambda$ in order to follow closely the gradient (''steepest descent algorithm'').

For intermediate value of $\lambda$ , we will thus follow a direction which is a mixture of the ``steepest descent step'' and the ``Newton step''. This direction is based on a perturbated Hessian matrix and can sometime be disastrous (There is no geometrical meaning of the perturbation $\lambda I$ on

).

This old algorithm is the base for the explanation of the update of the trust region radius in Trust Region Algorithms. However, in Trust Region Algorithms, the direction $\delta_k$ of the next point to evaluate is perfectly controlled.

To summarize:

Line search methods:
We search for the step $\delta_k$ with $\nabla$ $\mbox{$\cal Q$}$ $(\delta_k)=0$ and we impose $B_k \geq 0$
Trust Region methods:
The step $\delta_k$ is the solution of the following constrained optimization problem:

$\displaystyle \displaystyle$ $\displaystyle \mbox{$\cal Q$}$ $\displaystyle (\delta_k) = \min_\delta$ $\displaystyle \mbox{$\cal Q$}$ $\displaystyle (\delta) \;\;\;$ subject to $\displaystyle \Vert \delta \Vert < \Delta_k$

can be any matrix. We can have $\nabla$ $\mbox{$\cal Q$}$ $(\delta_k) \neq 0$ .

Why is Newton's method crucial: Dennis-Moré theorem

The Dennis-Moré theroem is a very important theorem. It says that a non-linear optimization algorithm will converge superlinearly at the condition that, asymptotically, the steps made are equals to the Newton's steps ( $A_k \delta_k=-g_k$ (

is not the Hessian matrix of $\mbox{$\cal F$}$ ; We must have: $A_k \rightarrow B_k$ (and $B_k \rightarrow H_k$ )). This is a very general result, applicable also to constrained optimization, non-smooth optimization, ...

$\begin{figure} \centering\fbox{\hspace{0.2cm}\parbox[t][5.2cm][b]{15.7cm}{ ... ...\Vert}{\Vert\delta_k\Vert}=0\end{equation}} } } \vspace{-0.1cm} \end{figure}$

definition:
A function

is said to be Lipchitz continuous if there exists a constant $\gamma$ such that:

$\displaystyle \Vert g(x)-g(y)\Vert \leq \gamma \Vert x-y \Vert$

(2.7)

To make the proof, we first see two lemmas.

Lemma 1.

$\fbox{\hspace{0.2cm}\parbox[t][1.7cm][b]{15.7cm}{ {\em We will prove that if ... ...\Vert ( \Vert v-x\Vert + \Vert u-x\Vert ) \end{equation}\vspace{-.5cm} } } }$

The well-known Riemann integral is:

$\displaystyle \mbox{$\cal F$}$ $\displaystyle (x+\delta)-$ $\displaystyle \mbox{$\cal F$}$ $\displaystyle (x)= \int_x^{x+\delta} g(z) dz \nonumber$

The extension to the multivariate case is straight forward:

$\displaystyle g(x+\delta)-g(x)= \int_x^{x+\delta} H(z) dz$

After a change in the integration variable: $z=x+\delta t \Rightarrow dz= \delta dt$ , we obtain:

$\displaystyle g(x+\delta)-g(x)= \int_0^{1} H(x+t \delta) \delta dt$

We substract on the left side and on the right side $H(x)\delta$ :

$\displaystyle g(x+\delta)-g(x)-H(x)\delta = \int_0^{1} ( H(x+t \delta) - H(x)) \delta dt$

(2.8)

Using the fact that $\displaystyle \Vert \int_0^1 H(t) dt \Vert \leq \int_0^1 \Vert H(t) \Vert dt$ , and the cauchy-swartz inequality $\Vert a \; b\Vert< \Vert a\Vert \Vert b\Vert$ we can write:

$\displaystyle \Vert g(x+\delta)-g(x)-H(x)\delta \Vert$

$\displaystyle \leq$

$\displaystyle \int_0^{1} \Vert H(x+t\delta) - H(x) \Vert \Vert\delta\Vert dt \nonumber$

Using the fact that the Hessian

is Lipchitz continuous (see Equation 2.9):

$\displaystyle \Vert g(x+\delta)-g(x)-H(x)\delta \Vert$	$\displaystyle \leq$	$\displaystyle \int_0^{1} \gamma \Vert t \delta \Vert \Vert \delta \Vert dt$
	$\displaystyle \leq$	$\displaystyle \gamma \Vert \delta \Vert^2 \int_0^{1} t \; dt$
	$\displaystyle \leq$	$\displaystyle \frac{\gamma}{2} \Vert\delta\Vert^2$	(2.9)

The lemma 2 can be directly deduced from Equation 2.12.

Lemma 2.

$\fbox{\hspace{0.2cm}\parbox[t][2.7cm][b]{15.7cm}{ {\em If $H$\ is Lipchitz co... ...hich respect $ \max( \Vert v-x\Vert, \Vert u-x\Vert) \leq \epsilon$.\\ } } }$

If we write the triangle inequality $\Vert a+b\Vert<\Vert a\Vert+\Vert b\Vert$ with

and

we obtain:

$\displaystyle \Vert g(v)-g(u) \Vert \geq \Vert H(x)(v-u) \Vert - \Vert g(v)-g(u)-H(x)(v-u)\Vert$

Using the cauchy-swartz inequality $\displaystyle \Vert a \; b\Vert< \Vert a\Vert \Vert b\Vert \Leftrightarrow \Vert b\Vert> \frac{\Vert a \; b\Vert}{\Vert a\Vert}$ with $a=H^{-1}$ and

, and using Equation 2.10:

$\displaystyle \Vert g(v)-g(u) \Vert \geq \Bigg[ \frac{1}{\Vert H(x)^{-1} \Vert} - \frac{\gamma}{2} (\Vert v-x\Vert+\Vert u-x\Vert) \Bigg] \Vert v-u\Vert$

Using the hypothesis that $\max( \Vert v-x\Vert, \Vert u-x\Vert) \leq \epsilon$ :

$\displaystyle \Vert g(v)-g(u) \Vert \geq \Bigg[ \frac{1}{\Vert H(x)^{-1} \Vert} - \gamma \epsilon \Bigg] \Vert v-u\Vert$

Thus if $\displaystyle \epsilon < \frac{1}{\Vert H(x)^{-1} \Vert \gamma}$ the lemma is proven with $\displaystyle \alpha=\frac{1}{\Vert H(x)^{-1} \Vert} - \gamma \epsilon$ .

Proof of Dennis-moré theorem.

We first write the ``step'' Equation 2.7:

$\displaystyle A_k \delta_k$	$\displaystyle =$	$\displaystyle -g_k$
0	$\displaystyle =$	$\displaystyle A_k \delta_k + g_k$
0	$\displaystyle =$	$\displaystyle (A_k-H^) \delta_k + g_k + H^ \delta_k$
$\displaystyle -g_{k+1}$	$\displaystyle =$	$\displaystyle (A_k-H^) \delta_k + [ -g_{k+1} + g_k + H^ \delta_k ]$
$\displaystyle \frac{\Vert g_{k+1}\Vert}{\Vert\delta_k\Vert}$	$\displaystyle \leq$	$\displaystyle \frac{\Vert(A_k-H^) \delta_k\Vert}{\Vert \delta_k \Vert}+ \frac{\Vert -g_{k+1} + g_k + H^ \delta_k \Vert}{\Vert \delta_k \Vert}$

Using lemma 2: Equation 2.10, and defining

, we obtain:

$\displaystyle \frac{\Vert g_{k+1}\Vert}{\Vert\delta_k\Vert} \leq \frac{\Vert(A... ...}{\Vert \delta_k \Vert}+ \frac{\gamma}{2} (\Vert e_k\Vert+\Vert e_{k+1}\Vert)$

Using $\lim_{k \rightarrow \infty} \Vert e_k\Vert=0$ and Equation 2.8:

$\displaystyle \lim_{k \rightarrow \infty} \frac{\Vert g_{k+1}\Vert}{\Vert\delta_k\Vert} =0$

(2.10)

Using lemma 3: Equation 2.13, there exists $\alpha>0, k_0 \geq 0$ , such that, for all

, we have (using

):

$\displaystyle \Vert g_{k+1}\Vert= \Vert g_{k+1} - g(x^*) \Vert \geq \alpha \Vert e_{k+1}\Vert$

(2.11)

Combing Equation 2.14 and 2.15,

0	$\displaystyle =$	$\displaystyle \lim_{k \rightarrow \infty} \frac{\Vert g_{k+1}\Vert}{\Vert\delta... ...lim_{k \rightarrow \infty} \alpha \frac{\Vert e_{k+1}\Vert}{\Vert\delta_k\Vert}$
	$\displaystyle \geq$	$\displaystyle \lim_{k \rightarrow \infty} \alpha \frac{\Vert e_{k+1}\Vert}{\Ver... ...\Vert+\Vert e_{k+1}\Vert}= \lim_{k \rightarrow \infty} \frac{\alpha r_k}{1+r_k}$	(2.12)

where we have defined $\displaystyle r_k=\frac{\Vert e_{k+1}\Vert}{\Vert e_k\Vert}$ . This implies that:

$\displaystyle \lim_{k \rightarrow \infty} r_k =0$

(2.13)

which completes the proof of superlinear convergence.
Since

is Lipschitz continuous, it's easy to show that the Dennis-moré theorem remains true if Equation 2.8 is replaced by:

$\displaystyle \lim_{k \rightarrow \infty} \frac{\Vert (A_k-H_k)\delta_k\Vert}{... ... \rightarrow \infty} \frac{\Vert (A_k-B_k)\delta_k\Vert}{\Vert\delta_k\Vert}=0$

(2.14)

This means that, if $A_k \rightarrow B_k$ , then we must have $\alpha_k$ (the length of the steps) $\rightarrow 1$ to have superlinear convergence. In other words, to have superlinar convergence, the ``steps'' of a secant method must converge in magnitude and direction to the Newton steps (see Equation 2.4) of the same points. A step with $\alpha_k=1$ is called a ``full step of one''. It's necessary to allow a ``full step of one'' to take place when we are near the optimum to have superlinear convergence. The Wolf conditions (see Equations 13.4 and 13.5 in Section 13.1.2) always allow a ``full step of one''.

When we will deal with constraints, it's sometime not possible to have a ``full step of one'' because we ``bump'' into the frontier of the feasible space. In such cases, algorithms like FSQP will try to ``bend'' or ``modify'' slightly the search direction to allow a ``full step of one'' to occur.

This is also why the ``trust region radius'' $\Delta_k$ must be large near the optimum to allow a ``full step of one'' to occur.

Next: A simple trust-region algorithm. Up: An introduction to the Previous: An introduction to the Contents

Frank Vanden Berghen 2004-04-19