## Numerical non-linear inversion techniques

In most cases of non-linear inversion, we need to look for numerical solutions to the problem of finding the minima of our defined error functions. Computationally, these need to be as efficient and robust as possible i.e. using the optimum amount of information and iterations. There is no 'best' solution - be aware that different techniques can be used in an attempt to find the optimal solution.

In inverting canopy reflectance models against reflectance data, we are usually interested in constrained optimisation - a special case of the general optimisation problem where some variables may be fixed, or lie between fixed limits. Figure 1 shows some of the difficulties this may pose in the 1-D case (from Numerical Recipes in C, Press et al., p 395).

Figure 1 - points A, C and E are local, but not global maxima. Points B and F are local, but not global minima. The global maximum occurs at G, which is on the boudary of the interval so that the derivative of the function need not vanish there. At E, derivatives higher than the first vanish, which can cause difficulties for some algorithms. The points X, Y and Z are said to bracket the point F since Y is less than both X and Z i.e. Y is a "first guess" at a minima in this region. It is important to note that in the general case of N-dimensional optimisation (particularly if not using derivatives), the solution is reduced to a series of N (iteratively improving) minimisations in 1-D. As a result, nearly all the algorithms we will see can be reduced to the problem of finding the extrema of a 1-D function.

How do we choose which methods we use? We need to decide between methods that need only to evaluate the function, against those which calculate (partial) derivatives. Derivative methods may often be more powerful in terms of finding global minima (assuming functions are well-behaved and differentiable). Derivative methods such as the conjugate gradient algorithms of Fletcher-Reeves and Polak-Ribiere, and the well-known quasi-Newton algorithms of Davidon-Fletcher-Powell (DFP) and Broyden-Fletcher-Goldfarb-Shanno (BFGS) are widely used in practice due to their robustness. Derivative methods however require many more function evaluations than is desirable for rapid inversion and also require a priori knowledge of the behaviour of the function. Because of this, we will concentrate on methods not requiring calculation of derivatives. Another computational issue is that of storage: we need to decide between methods requiring storage of order N (N is the number of dimensions of the minimisation problem), and order N2.

## Golden Section Search

This method is one of the simplest for determining extrema, and describes the successive bracketing of a minimum of a function f to provide an improving 'guess' at the 'real' value of the minimum.

Figure 2 (see Press et al., p 398) - successive bracketing of a minimum. The minimum is originally bracketed by points 1,3, and 2. The function is evaluated at 4, which replaces 2; than at 5, which replaces 1; then at 6, which replaces 4. The rule at each stage is to keep a center point that is lower than the two outside points. After the steps shown, the minimum is bracketed by points 5,3, and 6.

A minimum is bracketed when there is a triplet of points a < b < c (or c < b < a) such that f(b) < f(a) and f(c). In Figure 2, points 1,3 and 2 equate to points a, b and c i.e. f(3) < f(1) and f(2). We can successively improve this bracketing by performing the following steps:

• Evaluate f at some new point x, either between 1 and 3, or between 3 and 4.
• Choosing a point between 3 and 4 say, we now evaluate f(4).
• f(3) < f(4), so the new bracketing triplet of points is 1,3 and 4.
• Choose another point at 5, say, and evaluate f(5).
• f(3) < f(5), so the new bracketing triplet is 5, 3 and 4.

• In each case, the middle point of the new triplet is the best minimum achieved so far. This process is repeated until the distance between the two outer points becomes sufficiently small (i.e. approaches some set limit) - where sufficiently small means approximately the square root of machine floating point precision.

An important question you should be asking is "How do we choose the new point x each time?" ......

Figure 3

If we say that b is some fraction w of the way between a and c then w = b-a/c-a, and, consequently, 1-w = c-b/c-a. If the new point, x, is some distance z beyond b, then z = x-b/c-a. The next bracketing segment will therefore either be of length w+z relative to the current one, or of length 1-w. To minimise the worst case, then we can choose z to make these equal i.e. z = 1 - 2w. This gives us the symmetric point to b in the interval a to c i.e. |b - a| = |x - c|. This implies that x lies in the larger of the two segments ( z > 0 only if w < 1/2).

How does the value of w get chosen? It must come presumably from the previous stage of applying the same strategy, so if z is chosen to be optimal, then so was w. This scale similarity implies that z should be the same fraction of the way from b to c (if that is the bigger segment) as was b from a to c i.e. w = z/(1-w). We can use these two equations in w and z to form the quadratic equation w^2 - 3w +1 = 0 yielding w ~ 0.38197, and 1-w ~0.61803. Therefore the optimal bracketing interval (a,b,c) has its middle point b a fractional distance 0.38197 from one end, and 0.61803 from the other. These are the so-called golden mean fractions.

In each step of our golden section search therefore, we select a point a fraction 0.38197 into the larger of the two intervals (from the central point of the triplet). An additional feature of this method is that if our initial triplet has segments not in the golden ratios, the procedure outlined above will quickly converge to the proper self-replicating ratios.

## Parabolic Interpolation and Brent's Method in One Dimension

The golden section search is a method designed to handle the least favourable of function minimisations: no assumptions are made about the shape of the function in the region of the minimum, and the minimum is bracketed with greater and greater precision with each iteration. However, if the function is smooth near the minimum, which is likely for most 'well-behaved' functions, then a parabola fitted through any three points near to the minimum should take us very close to the actual minimum in one step. Figure 4 illustrates this case.

Figure 4 - convergence by inverse parabolic interpolation (see Press et al., p 403)

This method is known as inverse parabolic interpolation, as we want to find a point x at which our function y=f(x) is a minimum (not just a value of y). The formula for the abscissa x that is the minimum of a parabola through three points f(a), f(b) and f(c) is

```x = b - 1/2 * (b-a)^2*(f(b) - f(c)) - (b-c)^2*(f(b)-f(a))
-------------------------------------------------
(b-a)*(f(b)-f(c)) - (b-c)*(f(b)-f(a))```
This equation fails only if the three points are collinear. However, it is just as likely to jump to a maximum as a minimum, and somehow this tendency has to be overcome.

A technique likely to be both stable and efficient is one which will use a slow-but-sure method such as the golden section search in unfavourable areas, but can also switch to parabolic interpolation when it comes sufficiently close to a minimum. In this way, we can be sure we are apporaching a minimum rather than a maximum (or vice versa) before switching to the more efficient parabolic method. The main difficulties of this approach are in finding a quick, robust scheme for deciding which of the two techniques is likely to be favourable, in addition to being careful not to get into problems when we are very near to a solution and the function is being evaluated close to the machine precision mentioned earlier.

Brent's method is such a scheme. At each stage, it keeps track of six function points, a, b, u, v, w, and x. The minimum is bracketed between a and b; x is the point with the least function value found so far (or the most recent in the case of a tie); w is the point with the second least function value; v is the previous value of w; u is the point at which the function was evaluated most recently.

Parabolic interpolation is tried through the points x, v, and w. To be accepted as a successful move, the parabolic step must fall between a and b and imply a movement from the best current value, x, that is less than half the movement of the step before last. This second critetion insures that the parabolic steps are actually converging to something. In the worst possible case, where the parabolic steps are acceptable but useless, the method will approximately alternate between parabolic steps and golden sections, converging in due course by virtue of the latter. Press et. al. explain the reason for choosing the step before last (as opposed to the previous step) as purely heuristic i.e. through experience, it seems better not to 'punish' the algorithm for taking one bad step if it can make it up on the next one.

## Multidimensional minimisation methods

The minimisation problem becomes far more complex in the multidimensional case, which (not surprisingly!) is of most interest to us. Non-linear canopy reflectance models are likely to be functions of many parameters (the Kuusk model, for example, has 15 separate parameters). The next section describes some methods that we can use in these cases. An excellent starting point to look for material regarding optimisation problems, with material on a huge range of optimisation techniques and associated literature are http://solon.cma.univie.ac.at/%7Eneum/glopt.html and http://www.csc.fi/math_topics/Movies/Opt.html. They include some excellent visualisations and animations of optimisation problems (GA).

Downhill Simplex Method (aka amoeba or Nelder-Mead)

This is the only multidimensional minimisation method we will come across here which does not make use of a 1-D minimisation algorithm like the ones described above. The downhill simplex method (due to Nelder and Mead), is a method requiring only function evaluations, not derivatives. It is a relatively simple method, and although not particularly efficient (in comparison to Powell, which we will come across later), it can often be the most robust.

A simplex is the geometrical figure consisting, in N dimensions, of N+1 vertices and all their interconnecting line segments, polygonal faces etc. In 2-D this is a triangle, in 3-D a tetrahedron. If any point of the simplex is taken as the origin, then the other N points define vector directions that span the N-dimensional vector space. There is no analogous bracketing method (as we saw in 1-D) in multidimensions. All we can do is give the algorithm a starting guess i.e. an N-vector of independent variables to try as the first point. The algorithm is then supposed to make its own way "downhill" (in function space) until it finds a minimum. Figure 5 illustrates this idea.

Figure 5 - possible outcomes for a step in the downhill simplex method (Press et al., p 409).

If we consider the starting point, in N dimensions we must define N+1 points, which define the initial simplex. If we define one of these points as Po then the other N points are simply

Pi = Po + kei

where the ei's are N unit vectors, and where k is a constant which is our guess of the problem's characteristic length scale. We proceed by taking a series of steps away from the starting point, and seeing if these steps take us to a more favourable position (closer to a minimum). The steps available are (a) a reflection away from the high point (b) a reflection and expansion away from the high point (c) a contraction along one dimension from the high point or (d) a contraction along all dimensions towards the low point. When it can, the algorithm expands the simplex in one or other of the directions, to take larger steps. When it reaches a "valley floor" (think of a long, steep-sided valley sloping away from us), it contracts itself in the transverse (cross-valley) direction in order to ooze down the valley. If these steps are not available, the simplex can contract around its lowest (best) point.

How do we know when we have reached a minimum? This can be a difficult question in any multidimensional minimisation algorithm. The sensible thing to do is to see whether one of two criteria are satisfied, namely (i) identify one "step" of the algorithm - when the vector distance moved in one step is less than some tolerance, then stop, or (ii) require that the decrease in the function value in one step be fractionally smaller than some defined tolerance.

A final warning is that both these termination criteria can be fooled by a single, anomalous, step which doesn't go anywhere. To avoid this, it is a good idea to restart the any multidimensional minimisation algorithm from the point where it claims to have found a minimum. If it "really" has, this will not be a costly move, and if it hasn't, well ......

## Direction Set (Powell's) Method

Another technique for multidimensional minimisation is Powell's conjugate set method. This is another method not requiring the caclulation of gradients, however this time it is a method which utilises the 1-D line minimisation techniques we saw earlier. If we start at a point P in N-dimensional space, and proceed from there in some vector direction n, then any function of N variables f(P) can be minimised along the line n by using the 1-D methods. Different methods differ only in the way they choose the next direction, n, to try.

Let us look at the simplest direction set method. If we take the unit vectors e1, e2,....eN as a set of directions, we should be able to cycle through the directions, using our 1-D line minimisation methods to minimise along each direction in turn, before going on to the next. This is a good method in some cases, but figure 6 shows where this method struggles.

Figure 6 (see Press et al. p. 414)

In the case of a long, narrow valley, not aligned with our chosen axes (X and Y in this case), this method has to follow our base axes, and so must take many small steps to crawl towards the minimum. In this case, a more sensible choice of directions is needed. All direction set methods consist of recipes for updating the set of directions as the method progresses, attempting to come up with a set which either (i) includes some very good directions that will take us fast along narrow valleys (e.g. at 45o to X and Y in figure 6), or else (ii) includes some number of "non-interfering" directions with the special property that minimisation along one is not "spoiled" by subsequent minimisation along another, so that we do not have to contiually cycle through all directions.

Powell's method does just this. These "non-interfering" directions are more formally defined as conjugate directions, and Powell's method produces N mutually conjugate directions along which to minimise. First, we initialise a set of directions ui = ei where i=1,....,N. Next we repeat the following steps:

• Save the starting position as Po
• For i=1,....,N, move Pi-1 to the minimum along direction ui and call this point Pi
• For i=1,...,N-1, set ui = ui+1
• Set uN = PN - Po
• Move PN to the minimum along direction uN and call this point Po

• This method is illustrated in figure 7.

Figure 7 - Powell's conjugate direction set method. The first steps move to a new point, P, and line minimisation continues from P along the new set of directions x' and y which are x' is the resultant direction after our first step. In this case, it is clear that ideally we should be minimising along the axes xideal and yideal, but even after one step, our new set of directions is far closer to these than originally, which is promising.

It can be demonstrated that N iterations of the basic procedure i.e. N(N+1) line minimisations will exactly minimise a quadratic form. There is, however, one major drawback to Powell as stated - the procedure of throwing away, at each stage, u1 in favour of PN - Po tends to produce sets of directions that "fold up on each other" i.e. they all start to point in the same way! If this happens, then the algorithm converges only to a minimum of the function over some subsapce of the full N-dimensional case, giving the wrong answer. To avoid this, we can try a number of things:

• Reinitalise the set of directions ui to the basis vectors ei after every N or N+1 iterations. This is OK if it is important to retain the quadratic convergence property, which may be the case if our function is close to quadratic.
• Brent states that the direction set can just as easily be reset to the columns of any orthogonal matrix. Rather than throw away the information on conjugate directions already built up, he suggests reconfiguring the direction set to the calculated principal directions of this matrix.
• Powell himself suggested that we might sacrifice the quadratic convergence property in favour of a more heuristic scheme which attempts to find a few "good" directions along narrow valleys instead of N necessarily cojugate directions.

• In practice, any of these adjustments to the basic Powell method may be used, with each having slight differences in performance under different situations.

## Simulated Annealing

We will now briefly look at a few completely different and novel minimisation algorithms that move away from the idea of N-dimensional minimisation as a combination of a few relatively simple techniques such as line minimisation in 1-D and derviative methods. The first of these is simulated annealing. The minimisation algorithms we have looked at up to now all roughly follow the same principles - have a first guess at a solution, minimise in some linear direction, and keep going downhill (in function space) until we reach a minimum. These methods are all very well, but their common flaw is that they immediately head for the fastest downhill route, and will often find a local minima, as opposed to a global one - if we want to avoid this, we generally have to restart the whole thing from another location and see if it converges to the same solution. In a situation where the function has only a few distinct minima this may be sufficient, but in the case where the function has many small, poorly defined minima, this may be of litle use. Figure 8 illustrates this problem.

Figure 8 - schematic representation of a function with several local minima, A, B, C, and D, and a global minima at E.

Simulated annealing is so named because it has a direct thermodynamic analogy with the annealing, or slow cooling, of metals, or the crystallisation of liquids. At high temperatures, the molecules of a liquid move freely with respect to each other. If the liquid is cooled slowly, thermal mobility is lost, and the atoms are often able to line themselves up and form a pure crystal that is completely ordered over a distance of up to billions of times the size of an individual atom in all directions. This crystalline state is the state of minimum energy of the system. For slowly cooled systems, Nature is able to find this state every time. If a liquid metal is quenched, or cooled quickly, it ends up in some polycrystalline or amorphous state having much higher energy (this situation is partly analogous to the other minimisation processes we have considered). How is Nature able to solve such an incredibly complex optimisation problem, seemingly with no effort, when we are struggling to solve problems requiring minimsation of only a few parameters?

The basic process is one of slow cooling, allowing time for the atoms to redistribute themselves as they lose mobility (energy). The annealing process is based on the very important Boltzmann probability distribution

### Prob(E) ~ exp(-E/kT)

where k is a constant relating temperature to energy. This equation simply states that a system in thermal equilibrium at temperature T has its energy probabilistically distributed among all different energy states E. That is, all energy states are possible, but some are far more likely than others. Even at low T there is a small probability that the system may be in a high energy state. There is, therefore, a corresponding chance for the system to get out of a local energy minimum in favour of finding a better, more global one. This means our system can go uphill occasionally, as well as downhill, although as the temperature falls, it is less and less likely that it will go uphill.

The algorithm for simulated annealing is implemented in the following way. Offered a succession of options, a simulated thermodynamic system is assumed to change its configuration from energy E1 to E2 with probablity P = exp[-(E2 - E1)/kT]. If E2 < E1, P > 1. In this case, P is arbitrarily assigned the value 1 i.e. the system always took such an option. However, if E2 > E1 i.e. the option is to go to a higher energy state, P is postitive, but less than 1 i.e. there is a small but finite probability that the system may take this option. Hence our system will always take a downhill step that is offered to it, but will sometimes take an uphill one if offered that.

In practice, we need to provide the following elements if we wish to use this algorithm:

• A description of possible system configurations i.e. available states that the system "could" be in.
• A generator of random changes in the configurations - these changes are the "options" presented to the system.
• An objective function E (analog of energy) whose minimisation is the goal of the procedure.
• A control parameter T (analog of temperature) and an annealing schedule which tells how it is lowered from high to low values e.g. after how many random changes in configuration is each downward step in T taken, and how large is that step.

• Figure 9 illustrates this procedure.

Figure 9 - a 1-D function with 6 separate energy maxima and minima. In the annealing process, going from E1 to E2, E3 to E4 and E5 to E6 are all downhill steps, and so have a probability of 1. Going from E2 to E3, and from E4 to E5 are =uphill steps, and consequently have a small but finite probability of occurring. This small probability is what controls the possibilty of going from the local minima at E2 and E4, to the global minima at E6. Other minimisation algorithms that we have looked at do not possess this ability.

The simulated annealing idea is applicable to the optimisation of continuous N-dimensional functions, f(x), where x is an N-dimsensional vector. The four elements required by the above procedure are now as follows: the value of f is the objective function. The system state is the point x. The control parameter T is, as before something like a temperature, with an annealing schedule by which it is gradually reduced. There must also be a generator of random changes in the configuration i.e. a procedure for taking a random step from x to x + dx, which are the new "options" available to the system. In practice, this may be the most difficult of the steps to implement, and a number of solutions have been proposed for this (see Press et al., p 451).

All applications of simulated annealing suffer from the difficulty of defining the rate of cooling. How slow is "sufficiently slow"? The success of the approach may be controlled by this factor. There are a number of possibilities worth trying:

• Reduce T to (1-e)T after every m moves, where e/m is determined by experiment.
• Budget a total of K moves, and reduce T after every m moves to a value T = To(1-k/K)a, where k is the cumulative number of moves thus far, and a is a constant, say 1,2 or 4. The optimal value of a depends on the statistical distribution of relative minima of various depths. Larger values of a spend more iterations at lower temperature.
• After every m moves, set T to b times f1-fb where b is an experimentally determined constant of order 1, f1 is the current lowest function value available, and fb is the "best-ever" function value encountered.

• Simulated annealing has been applied in practice to solve a classic minimsation problem known as the "Travelling Salesman" problem (see Press et al., p 445-450). The aim is to find the shortest possible route for a travelling salesman visiting N arbitrarily located cities in turn before returning to the starting point (visiting each city only once). This kind of problem is known as an NP-complete problem, and computation time for an exact solution rises proportionally to ekN where k is some constant. As N increases, the time taken to try all solutions rapidly becomes prohibitive. Simulated annealing solves this with the number of calculations only scaling as a small power of N. This has made simulated annealing particularly useful in specialised problems such as that of locating components during chip design in order to minimise heat generation and interference.

There are a variety of other intriguing methods of multidimensional function optimisation that have arisen from i) the need to find sophisticated methods to solve a huge range of optimisation problems in a whole range of mathematics, physics, engineering and modelling applications, many of which do not surrender easily to the "standard" methods mentioned above ii) increases in computing power that have pushed development of more sophisticated algorithms iii) the observation of the way Nature has created very elegant ways of solving seemingly impossible optimisation tasks.

These methods are based on the observation that bioligical neural networks (i.e. brains) are exceptionally good at solving incredibly complex non-linear optimisation problems. They can do this by "training", system responses to behave according to previously experienced situations, and adapting these responses to new situations.

There is no universally accepted definition of an ANN, but a general definition is that an ANN is a network of many simple processors ("units"), each possibly having a small amount of local memory. The units are connected by communication channels ("connections") which usually carry numeric (as opposed to symbolic) data, encoded by any of various means. The units operate only on their local data and on the inputs they receive via the connections. The restriction to local operations is often relaxed during training.

This figure schematically illustrates the structure of a simple feed-forward neural net with only one hidden layer, and one output (the figure comes from http://www.cbs.dtu.dk/services/SignalP/methods_nn.html).

The neurons have several inputs but only one output. The output from a neuron is calculated as a function of a weighted sum of the inputs. A threshold can be set so that the neuron can have a non-linear output response - if the signal is less than the threshold, the output is zero, greater than the threshold and the ouput is 1, or the weighted sum of the inputs. In an ANN, neurons are connected (with the output of one neuron being the input of another neuron). When the ANN is presented with an input (e.g. a set of reflectance data) it will calculate an output that can be interpreted as a classification of the input (e.g. the model parameter values that result in this reflectance). It is possible to ``teach'' a neural network how to make a classification by repeatedly presenting it with a set of known inputs (the training set - e.g. directional reflectance data sets) and simultaneously modifying the weights associated with each input in such a way that the difference between the desired output (e.g. measured reflectance) and the actual output (e.g. modelled reflectance) is minimised according to some measure such as RMSE. The ability of the ANN to change the output to reflect the input is important, this ability gives the neuron the capability to learn. The more (distinct) training data we can throw at the ANN, the more refined the behaviour of the ANN should become i.e. better able to distinguish between finer variations in the input data. This is achieved via back propagation training.

A general Neural Network algorithm is as follows:

• Set the inital weights of each node
• Repeat
• Show input example and desired output
• Work out actual output
• Update weights
• Until no more examples

ANNs have a number of advantage of solving the types of non-linear optimistion problems we have been considering: they can be used where we can't formulate an algorithmic solution, where we can get lots of examples of the behaviour we require or where we need to pick out the structure from existing data. A major drawback is that when we change the nature of the training data e.g. different angular or spectral sampling, then the ANN must be retrained. It suggests that ANN inversion of multispectral data might be a possibility for fixed geometry, multiple waveband sensors (e.g. TM) but is not so useful in the general case (arbitrary viewing and illumination geometry). There are a number of examples of the application of ANNs to the problem of inverting biophysical parameters from optical and microwave data of vegetation canopies e.g. Chuah, 1993; Pierce et al., 1994; Gopal and Woodcock, 1996; Jin and Liu, 1997; Abuelgasim et al., 1998. In addition, ANNs have proved very promising for the classification of EO data, where supervised methods (requiring training data) are already the norm e.g. Foody, 1997a and 1997b.

An excellent introduction to ANNs is at Stirling University with many external links. For more information on neural nets see the page mentioned above, or the Neural Net FAQ .

It has long been noted that Nature manages to solve THE optimisation problem, i.e. maximisation of species survival and propagation, through genetic mutation and the subsequent favouring of a few of these mutations in the long term. If we can phrase our optimisation problems in terms of some non-linear "fitness-for-survival" function, genetic mutation and propagation models can be applied to these problems in order to find optimal solutions. Effectively a GA is a method by which some description of the function state (known as a "string" - equivalent to a gene pattern) is changed in some way according to a probabilistic model (typically crossover, mutation or inversion). In the case of a BRDF model, this would be the N-D vector representing the current state of the model parameters. This mutation is then propagated to the subsequent generation in some predefined way, such as "mating" - the recombination of two parent strings to create an offspring string. Finally, the new generation "mutation" is tested to see if it better fulfils our survival criteria (or has a lower RMSE in terms of model fit). In this way we can see that the GA progresses (as in Nature) through mutation and selection via sexual reproduction, characterised by recombining two parent strings into an offspring.

• A general Genetic Algorithm is as follows (see http://osiris.sunderland.ac.uk/cbowww/AI/TEXTS/ML2/sect4.htm) for an illustrated example of this):

• Populate a set of chromosomes
• Repeat
• Determine fitness of chromosomes
• Choose best chromosomes
• Evolve chosen chromosomes using crossover, mutation or inversion
• Until a chromosome is found of a suitable fitness

GAs differ from normal optimisation methods in four ways:

1. They work with a coding of the parameter set, not the parameters themselves
2. They search from a population of points, not a single one
3. They use 'payoff' information (some objective function) - not derivatives or other auxilliary knowledge.
4. GAs use probabilistic transition rules, not deterministic ones (c.f. simulated annealing).
The keys to GA are the inital coding of the description of the function state (into strings), and the way in which these strings are changed at each step according to some rule (e.g. crossover or mutation, mentioned above)

The most basic GA is as follows:

• STEP 0: Define a genetic representation of the problem
• STEP 1: Create an initial population P(0) = x10, ...., xN0. Set t=0.
• STEP 2: Compute the average fitness f(t) = SUMiNf(xi)/N. Assign each individual the normalised fitness value f(xi)/f(t).
• STEP 3: Assign each xi a probability p(xi, t) proportional to its normalised fitness. Using this distribution, select N vectors from P(t). This gives the selected set of parents.
• STEP 4: Pair all parents at random, forming N/2 pairs. Apply crossover with a certain probability to each pair and other genetic operators such as mutation, froming a new population P(t+1).
• STEP 5: Set t = t + 1, return to STEP 2.

• Some advantage of GAs are their flexibility and power - they may be able to solve problems characterised by very many small, ill-defined minima, the sort of problem that is typically almost impossible to solve by standard methods. Another major advantage is that they are intuitively understandable. The difficulty of using GAs practically is that they combine two different search strategies: random search by mutation and a biased search by recombination of the strings contained in a population, and defining the balance between them is by no means straightforward. More relevant to problems in CR model inversion is the fact that it may take a huge number of generations (iterations) for such a system to converge to a solution.

I have donwloaded several useful references to GAs, of which there are copies in the RSU. Also, see Goldberg (1988), MATHS NA20 GOL in the DMS Watson library.

Knowledge-based systems (KBS)

KBS comprise a variety of methods seeking to solve complex computational problems using sources of information external to the problem itself. In CR model inversion, work done on KBS attempts to use an understanding of the inversion problem to drive the optimisation process. An example in vegetation modelling is the work of Kimes and Harrison (1990) and Kimes et al. (1991) who developed VEG, a system for inferring physical and biological surface properties from reflectance data sets. VEG integrates the traditional spectral (directional) data with diverse knowledge bases derived from literature, field-measured directional reflectance data sets and information from human experts into an efficient system for inferring biophysical parameters from reflectance data. Significant devlopmental work was done on KBS in the late 80s - early 90s but the major problem of such systems, namely how the knowledge is defined and input into the system, has never been overcome satisfactorily. This approach has been largely taken over by ANNs, where the decisions as to what response to a particular input is used is concealed in the 'black box' hidden layer(s) of neurons. The drawback of this method of course is that no knowledge of how the problem is being solved within the black box is available - something that is made explicit in the KBS approach. A more promising approach for CR model inversion than either ANN (limited by training data) or KBS (how do we define knowledge data?) is that of look-up tables (LUT).

Look up table approaches (LUT)

The concept behind the LUT approach as applied to CR model inversion is to separate the calculation of canopy reflectance from the problem of finding the optimum solution (model parameter set) to a specific model inversion. A set of discrete CR scenarios can be pre-computed e.g. according to some geometrical/spectral criteria (sampling regime), and the corresponding parameter values are then stored in the LUT. The optimisation task is then to find the required values in the LUT as quickly as possible, given a reflectance data set. Knowledge of external parameters e.g. VIs, cover type, viewing/illumination geometry etc. are then used to limit the search space in as many ways as possible. Closest matching parameters can then be pulled out of the LUT yielding the inverted model parameter set. The use of external information (knowledge) to limit the search space implies that LUT methods are closely related to KBS methods. LUT methods have a simpler requirement for the use of external data however, and can be applied quite simply in practice e.g. MODIS LAI/FPAR algorithm (see MOD15 LAI/FPAR ATBD, (.pdf version - view using acroread)). An important advantage of the LUT approach is that all the computational donkey work i.e. the repeated running of the forward model is done prior to parameter retrieval and is therefore not a run-time operation. In addition, as developments and improvements are made to the underlying CR model, the LUT can be recomputed and updated accordingly.

## Summary

A number of minimisation (more generally, optimisation) techniques that can be used to derive biophysical parameter information via the inversion of CR models against reflectance data have been introduced. A brief summary of these, including some of the pros and cons in each case is given below.

• most commonly used (ignoring derivative information) are Powell (conjugate direction set) and Nelder and Mead (amoeba). Relatively fast and robust, but tend to converge to local minima -> need to run several times from different starting positions.

• Simulated annealing - probabilistic partition of energy states of a system - allows finite probability of system going from low -> higher energy state so big advantage is ability to get out of local minima. Drawback is maybe slower, and how to define annealing schedule.

• ANNs - network trained using "known" reflectance data, to derive model parameters from measured data. Very fast once trained. Limitation is need for training data sets to be of same nature (i.e. viewing, illum. geometry, spectral bands) as measured data. Has been applied to EO data. Also good for classification.

• GAs - very novel approach to problem of N-D optimisation - mutation and reproduction, survival-of-the-fittest. Advantage is long, gradual change may allow for very subtle solutions but likely to be very slow - few examples of applications in EO (eg. Clark and Canas, 1993), but many in engineering.

• KBS - use additional information to refine process of finding solution e.g. Kimes et al. VEG model. Advantages, maximise use of data, potentially more robust inversion. Disadvantage of how additional information is defined and used.

• LUT - subset of KBS (?) - Precalculate many CR scenarios e.g. many viewing/illum geom., spectral bands, biome types etc. prior to run-time. At run-time, follow a search path (tree) using additional data (e.g. biome type), to drastically reduce search space before searching for best-fit solution. Fast once LUT is computed and can just recompute LUT if model is altered. Disadvantage is defining search path and additional data to be used.

Abuelgasim, A. A., Gopal, S. and Strahler, A. H. (1998) Forward and inverse modelling of canopy directional reflectance using a neural network, INTERNATIONAL JOURNAL OF REMOTE SENSING, 1998, Vol.19, No.3, pp.453-471.

Brent, R. P., 1973, Algorithms for Minimisation without Derivatives (Englewood Cliffs, NJ: Prentice-Hall).

Bunday, Brian D. (1984) Basic Optimisation Methods (Edward Arnold, London).

Chuah, H. T. (1993) An artificial neural-network for inversion of vegetation parameters from radar backscatter coefficients, JOURNAL OF ELECTROMAGNETIC WAVES AND APPLICATIONS, 1993, Vol.7, No.8, pp.1075-1092.

Clark, C. and Canas, T. (1993) A genetic algorithm approach approach to spectral identification, Proc. RSS '93, Chester.

Foody, G. M. (1997a) Fully fuzzy supervised classification of land cover from remotely sensed imagery with an artificial neural network, NEURAL COMPUTING & APPLICATIONS, 1997, Vol.5, No.4, pp.238-247.

Foody, G. M. (1997b) An evaluation of some factors affecting the accuracy of classification by an artificial neural network, INTERNATIONAL JOURNAL OF REMOTE SENSING, 1997, Vol.18, No.4, pp.799-810.

Gopal, S. and Woodcock, C. (1996) Remote sensing of forest change using artificial neural networks, IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 1996, Vol.34, No.2, pp.398-404.

Goldberg, D. E. (1988) Genetic algorithms in search, optimization and machine learning, Addison-Wesley, 1988. - 0-201-15767-5.

Jin, Y. Q. and Liu C. (1997) Biomass retrieval from high-dimensional active/passive remote sensing data by using artificial neural networks, INTERNATIONAL JOURNAL OF REMOTE SENSING, 1997, Vol.18, No.4, pp.971-979.

Kimes, D. S., Harrison, P. R. (1990) Inferring hemispherical directional reflectance using a knowledge based system, Proc. IGARSS '90, Washington, p. 1759-1764.

Kimes, D. S., Harrison, P. R. and Ratcliffe, P. A. (1991) A knowledge-based expert system for inferring vegetation characteristics, INTERNATIONAL JOURNAL OF REMOTE SENSING, 1991, Vol.12, No.10,pp.1987-2020.

Kimes, D. C. and Nelson, R. F. (1998) Attributes of neural networks for extracting continuous vegetation, INTERNATIONAL JOURNAL OF REMOTE SENSING, 1998, Vol.19, No.14, pp.2639-2663.

Pierce, L. E., Sarabandi, K. and Ulaby, F. T. (1994) Application of an artificial neural-network in canopy scattering inversion, INTERNATIONAL JOURNAL OF REMOTE SENSING, 1994, Vol.15, No.16, pp.3263-3270.

Press, W. H., Teukolsky, S. A., Vetterling, W. T. and Flannery, B. P., (1994) Numerical Recipes in C (CUP, 2nd ed., pp. 994).

Privette, J. L. (1994) An efficient strategy for the inversion of bidirectional reflectance models with satellite remote sensing data, Ph.D. thesis, University of Colorado (postscript version - view using pageview, 3.12MB).

Take me home..