Demonstration: Gradient-based optimization
In mathematics, optimization (minimization or maximization of a function) is important foundation applied for many fields such as information science, economics, bioinformatics, and physics. It is difficult to analytically solve optimization problem because a target function is very complex in practical cases. We often rely on numerical or approximation method to obtain a good solution in such cases. In this article, we introduce practical gradient-based approaches (Steepest descent, Newton method, and Quasi-Newton method) for optimization problems and how to implement them by Sized Linear Algebra Package (SLAP). You can implement the methods as easy as other linear algebra libraries.
In this article, we consider minimization problem defined as
where $f : \R^n\to\R$ is a target function. Minimization problem can be converted into maximization problem by replacing $f(\bm{x})$ with $-f(\bm{x})$. A target function can be non-convex, but must be differentiable. We compute a minimal point instead of an exact minimum point of a target function since the latter is hard. Gradient-based approaches use the gradient of a target function for minimization:
Thus we need to compute $\hat{\bm{x}}$ such that $\bm{\nabla} f(\hat{\bm{x}}) = \bm{0}$.
Preliminary
First, load SLAP on OCaml REPL or utop
(#
at the head of each line is prompt):
Second, define target function $f$ you want to minimize. In this article, we choose a Gaussian function as a target:
where
Note that $\bm{A}$ is symmetric, and the minimum is -1 at (1, 3).
The Gaussian function is implemented as follows:
where dot is
a Level-1 BLAS function
and symv is
a Level-2 BLAS function.
The type of gauss
means
- the first argument is
'a
-by-'a
matrix, - the second argument is
'a
-dimensional vector, - the third argument is
'a
-dimensional vector, and - the return value has type
float
.
You can execute gauss
by passing $\bm{A}$, $\bm{b}$ and $\bm{x}$:
However you cannot give vectors and matrices that have sizes inconsistent
with the type of gauss
as follows:
The static size checking of SLAP protects you against dimensional inconsistency. If you get an error message like the above output, your program possibly has a bug.
Steepest descent method
Sample program: examples/optimization/steepest_descent.ml
Steepest descent (a.k.a., gradient descent) is a kind of iterative methods: we choose an initial value $\bm{x}^{(0)}$, and generate points $\bm{x}^{(1)},\bm{x}^{(2)},\bm{x}^{(3)},\dots$ by
where $\eta$ is a learning rate (a parameter for controlling convergence). The above update formula is easily implemented as follows:
Above code uses Level-1 BLAS function axpy.
The gradient of the target function is given by
dgauss
has a type like gauss
,
but dgauss
returns a 'a
-dimensional vector.
Run steepest descent method as follows:
Our program successfully found the minimum point (exact solution = (1, 3)). You can control speed of convergence by changing a value of the learning rate.
Bisection search of learning rate by Wolfe conditions
Sample program: examples/optimization/steepest_descent_wolfe.ml
Using too large learning rates may go through a minimal point (zigzag convergence), but quite small learning rates slowly reach a solution. In this section, we introduce an approach to find a learning rate achieving fast convergence: Wolfe conditions are used for finding an suitable learning rate. Wolfe conditions are
- $f(\bm{x}+\alpha\bm{p})\le f(\bm{x})+c_1\alpha\bm{p}^\top\bm{\nabla}f(\bm{x})$, and
- $\bm{p}^\top\bm{\nabla}f(\bm{x}+\alpha\bm{p})\ge c_2\bm{p}^\top\bm{\nabla}f(\bm{x})$
where $\bm{p}$ is a search direction, and $0<c_1<c_2<1$. These conditions give the upper bound and the lower bound of learning rate $\alpha$. We can find a learning rate satisfying the conditions by bisection search:
wolfe_search ?c1 ?c2 ?init df f p x
returns a learning rate
satisfying Wolfe conditions. Optional argument init
indicates the initial value of bisection search.
If computation cost of a target function
and its derivative is quite large, wolfe_search
is inefficient.
The following code implements steepest descent method with automatic search of learning rates by Wolfe conditions:
steep_descent_wolfe
achieves faster convergence as follows:
The convergence of Fig 3 is similar to that of Fig 2 (naive steepest descent described at the previous section), but the former is faster than the latter.
Newton method
Sample program: examples/optimization/newton.ml
Newton method (a.k.a., Newton-Raphson method) is also a kind of iterative approach using the second derivative of a target function addition to the first:
The second derivative (often called Hessian matrix) is defined by
Newton method is implemented by using sytri and symv as follows:
The Hessian matrix of the Gaussian function is given as
where syr and Mat.scal are BLAS functions.
Try newton
:
Newton method finds a minimal fast, while the approach has two problems:
- Convergence of the iteration is not guaranteed if a Hessian matrix is not
positive-definite symmetric.
You need to give an initial point close to a minimal for finding the minimal
because a Hessian matrix is positive-definite at points near a minimal.
Newton method starting from an initial point far from a minimal fails.
For example, function
newton
outputs a wrong result if you pass (0, -2) as an initial point since the Hessian matrix at (0, -2) is not positive-definite. - Computation of Hessian matrix and its inverse takes high cost. In addition, a target function is too complex to analytically compute its Hessian matrix in practical cases.
Quasi-Newton method
Sample program: examples/optimization/quasi_newton.ml
Quasi-Newton method is like Newton method, but an approximated inverse Hessian matrix is used. Several approximation approaches has been proposed, while we only introduce BFGS (Broyden-Fletcher-Goldfarb-Shanno) algorithm in this section.
Let $\bm{H}_t$ be an approximated inverse Hessian matrix at time step $t$, then the iteration of Quasi-Newton method is defined as
The inverse Hessian approximated BFGS is computed by
where $\bm{s}_t=\bm{x}_{t+1}-\bm{x}_t$ and $\bm{y}_t=\bm{\nabla}\bm{f}(\bm{x}_{t+1})-\bm{\nabla}\bm{f}(\bm{x}_t)$. $\bm{H}_0$ is an identity matrix. A learning rate at each time step is chosen by Wolfe conditions for keeping $\bm{H}_t$ is positive-definite.
The following function update_h
takes $\bm{H}_t$, $\bm{y}_t$ and $\bm{s}_t$,
and destructively assigns $\bm{H}_{t+1}$ into the memory of argument $\bm{H}_t$
(parameter ?up
specifies using upper or lower triangular of $\bm{H}_t$).
Using update_h
, Quasi-Newton method is implemented as follows:
Quasi-Newton method converges by iteration of only 7 steps:
The search direction at early steps of Quasi-Newton is wrong because $\bm{H}_t$ does not converge yet. However, after some steps, $\bm{H}_t$ approximates an inverse Hessian well, and convergence gets faster.
Visualization tips
Sample program: examples/optimization/visualization.ml
Figures in this article are plotted by gnuplot, an awesome graphing tool. In this section, we introduce how to draw figures like Fig 2, 3, 4, and 5 (i.e., visualization of convergence) by SLAP and gnuplot. SLAP provides no interface to gnuplot, but the visualization is not difficult.
Note: an OCaml library, Gnuplot-OCaml, is an interface to gnuplot from OCaml, while it does not support 3D graphs. Thus we don’t use this library in this section.
Using gnuplot from OCaml
Gnuplot is usually used on console (or called from a shell script):
$ gnuplot
gnuplot> plot sin(x) # Plot a sine curve
The above command shows a figure of a sine curve on a window. You can use gnuplot from OCaml by using pipes:
Note: if you pass command-line option -perisist
to gnuplot
(i.e., "gnuplot -persist"
instead of "gnuplot"
) at the second line,
a window is kept after gnuplot exited.
3D graph
plot
command (in the previous section) plots 2D graphs,
and splot
command draws 3D graphs.
The following example draws a 3D graph like Fig 1 (without contour):
The drawn function -exp((-0.1*(x-1)**2 - 0.2*(y-3)**2 + 0.2*(x-1)*(y-3)) / 2.0)
is
the same as the target function in this article.
However we want to directly plot function gauss
(defined in OCaml) for simplicity and
maintainability of code. The key idea to draw an OCaml function is to pass data points
to splot
command: splot
can plot not only functions defined in gnuplot
but also data points like:
gnuplot> splot '-' with lines # Plot data points given from stdin
# x y z
1 1 1
1 2 2
1 3 3
2 1 2
2 2 4
2 3 6
3 1 3
3 2 6
3 3 9
We can plot OCaml functions by conversion into data points:
splot_fun ?option ?n ?x1 ?x2 ?y1 ?y2 ppf f
draws the 3D graph of OCaml function f
where
?option
is a string of options forsplot
command,?n
is the number of points,?x1
and?x2
specify the range of x coordinates, and?y1
and?y2
specify the range of y coordinates.
Using splot
, gauss
can be plotted as follows:
Plotting contour
You can plot contour by set contour
command.
Plotting a line graph of convergence
Line graphs of convergence (like Fig 2, 3, 4, and 5) are 2D, but we draw them as 3D graphs because we will overlay a line graph on a contour graph.
Overlaying a line graph on a contour graph
Overlaying two graphs is achieved by set multiplot
.
The following code displays a graph similar to Fig 2, while
we used additional a few commands for improvement of visualization.
Sample program examples/optimization/visualization.ml contains the commands,
and shows the same graph as Fig 2.