# Advanced Numerical Methods

### Lecture 1: Introduction to Differential equations via Infinitesimal differences

Mathematical models are essential in solving many Engineering problems and these models can lead to the need for mathematical procedures that include differentiation, non-linear equations, integration, simultaneous linear equations and differential equations etc. These mathematical procedures can be used to solve many kinds of problems exactly if you understand how to apply them however numerical methods are more likely to be used to solve them approximately. [1]

### For example:

DCF is a method that is used for valuing a project, company or asset using the idea of the time value of money. The future cash flows are basically estimated and then they are discounted in order to find the present values. The DCF method is usually used in investment, real estate development and many other things. [2]

To calculate the discounted cash flow the following formula is used:

And so

Where

DPV is the discounted present value of the future cash flow

FV is the nominal value of a cash flow amount in a future period

i is the interest rate

d is the discount rate

n is the number of years before the cash flow occurs

The example used to explain how to use the formula in the lecture was:

John Doe buys a house for $100,000, three years later he expects to be able to sell the house for $150,000.

The interest rate needed in order for him to achieve $150,000 after three years would be:

So it can be seen that the interest rate value that he needs in order to achieve the property value that he wants in three years is 14.47%.

But since three years has passed since the purchase and the sale of the house it means that the cash flow from the sale has to also be discounted. So if when he bought the house the 3 year US Treasury Note rate is 5% per annum the present value would actually be:[3]

There are some risks involved with doing this like we are assuming that the $150,000 is John's best estimate but as it's about house prices nothing can be guaranteed and so the outcome could be different than what is expected.

When DCF analysis is usually done is the net present value after discounting the future cash flows is positive. If it is negative it would actually mean money is being lost even though there would appear to be some nominal profit being made.

Methods

There are usually four DCF methods that are used today:

* Equity- Approach

o Flows to equity approach

* Entity- Approach

o Adjusted present value

o Weighted average cost of capital

o Total cash flow

Time is money

Interest: “The cost of borrowing money, usually expressed as a percentage per year.”[4]

Interest can be applied to situations like if a person takes a loan from a bank, or if someone opens a savings account the bank will pay interest. However due to the economy at the moment it may not have a major effect on the amount as the interest rate is 0.5%. This is the lowest it has been in a long time.

When money is borrowed from a bank lender interest has to be paid which is basically a fee for borrowing the money. And so by the end of one year it will be:[3]

P is the principle; the money borrowed

r is the interest rate

For n years it would be:

This is similar to the Binomial expansion; in mathematics, the binomial theorem is an important formula giving the expansion of powers of sums.

So using the binomial series the expansion of the series above is

[5]

For example if P=100, r=8%, and n=2 years by inserting the numbers into both formulas we should receive the same value.[3]

Using MathCAD

This is the formula for calculating compound interest and it can be applied to a situation of when you deposit money in a savings account as you are lending the bank money and so interest will have to be paid to you.

So it can be said that money is time dependant.

Frequent Compounding:

The usual assumption is that interest is paid once a year but interest can be paid more frequently. It can make a big difference on the outcome:[3]

For Example:

If a £1000 is invested at 8% after one year:

But for quarterly compounding:

And if it's done monthly:

For the £1000 value used there really isn't much difference between the yearly and weekly compounding only a few pounds. However if the original value was bigger if it was in the hundred thousands or millions then whether the compounding is done weekly or yearly would make a huge difference. [6]

It can be seen that if the compounding frequency is less frequent than the payment frequency then it has little or no practical value. As the payment has already been made and the balance is reduced and there is nothing to compound. That's why the graph below is asymptotic and can be seen to start levelling off at 1083.287. [6]

Plotting the values from the frequent compounding formula in MathCAD the results can be seen below.

Also if the compounding is more frequent than the payment then so does the effective rate but more slowly.

What happens to k ∞?

As k tends towards infinity the effective rate will tend toward zero.

### Continuous compounding:

Continuous compounding is when the compounding period is infinitesimally small so that the limit of n tends towards infinity. [1, 2]

The sequence:

as k ∞

Deriving the equation from first principles:

Let the equation above be equal to L

Taking the natural log of both sides

Then

If log of a product is the sum of the logs:

So

When k gets large r/k gets smaller so ln(1+h)≈h can be used

Taking the Exp of both sides

How many days in a year

Some countries use a 360 day interest year like in the US. However Britain and the UK use a 365 day interest year even in a leap year. But some do use the 366 day interest year for a leap year. For the matter of daily compounding it doesn't make that much of a difference if the year is taken at 360, 365, or 366 days. If taken at 8% for a million pounds there is a difference of only 16p between a leap year and a normal year.[2]

### Double your money

How long does it take to double money invested in a savings account at 5% and at 7%?

A rule of thumb says that at x% money will double in 70/x years and 50/x years. So at 5% it would take 14 years and at 7% it would take 10 years to double the amount of money.

This comes from the ‘Rule of 72,' ‘Rule of 70' and the ‘Rule of 69'. It can be seen the number in the title is divided by the interest percentage so that the approximate number of periods can be obtained that is needed for doubling the amount of money. The rules are usually applied to exponential growth and so for compound interest. [2, 3]

### Deriving the rule:

A is invested, r is the interest rate and n is to be found

Taking the logs of both sides:

So

The A's cancel leaving

The log function can be used to any base by applying the formula

But for this the natural log can used instead

Natural logs are used so that the Maclaurin series can be applied

The Maclaurin series is

So taking

The Maclaurin's series is just a special case of the Taylor Series so when r is very small: so as |r|<1 for this example the terms of the Maclaurin series can be ignored. So [4]

If we round 0.6931 to 0.7

So using Excel it can be seen below that the different percentage rates how long it takes for money to be doubled.

x%=r

Rule of 70/x

ln2/ln(1+r)

1

70

69.66071689

2

35

35.00278878

5

14

14.20669908

7

10

10.24476835

10

7

7.272540897

The rule of 70 was derived under the assumption of annual compounding:

If

So it is much simpler as the Maclaurin series is not needed.

Regular investment pays

If at the beginning of each year P amount is invested into a savings account which pays interest at a rate of r. The amount of money that will end up in the account after n years can be represented by [2]

FV of first value

FV of second value

For n payments

Early Investment Pays

The best way to explain this is to simply solve an example.

Alice and Bob were 19 years old when they started new jobs; Alice started a savings programme investing £1000 at the beginning of each year at 10% per annum. At the end of 7 years she has £10435.89. Bob has not saved any money until now. He starts to investing like Alice, but Alice has stopped adding money to her account and is letting it grow at 10% p/a. How long will it take for Bob to overtake Alice?[2, 5]

### The calculations can be seen on the next two pages:

Year

Alice

0

10,435.89

1

11479.48

2

12627.43

3

13890.17

4

15279.19

5

16807.11

6

18487.82

7

20336.6

8

22370.26

9

24607.28

10

27068.01

11

29774.81

12

32752.29

13

36027.52

14

39630.27

15

43593.3

16

47952.63

17

52747.9

18

58022.69

19

63824.95

20

70207.45

21

77228.19

22

84951.01

23

93446.12

24

102790.7

25

113069.8

26

124376.8

27

136814.5

28

150495.9

29

165545.5

30

182100

31

200310

32

220341.1

For Bob:

Analysing the results it can be seen that it takes 32 years for him to save more than Alice and he has had to invest much more than she had to. She just invested £7000 in the first 7 years and then let it grow but he has had to invest £1000 every year until he overtakes her.

This shows that by investing a little bit of money early on can be more beneficial than investing a lot of money for a short time later on.

### Repayment of Loans

When a house or car is bought it may be necessary to borrow the purchase price from a bank and then the money can be paid back in regular equal instalments over a period of time. If the amount of money borrowed is A, the interest rate is r and the number of repayments is n then the amount needed to be repaid can be found using: [2, 6]

The main point is to look at the problem from the banks point of view: So A must be balanced by the present value of the repayments.

The present value of the first repayment is

The second repayment is

So for the nth repayment is

The total repayment value of the repayments is going to be

If a loan is taken out and then it is paid back in full after 10 years this is quite easy to do. However if the loan is paid back every month then interest has to be added on. What happens to the repayments can be seen below. [2]

### Example:

A loan of £100,000 is going to be repaid over 25 years in equal monthly instalments. If the annual interest rate is 7% what is the monthly repayment?[2]

A=100,000, r=0.07/12 because we are looking at the monthly repayments, n=25x12=300.

R= 706.78

The repayments are made up of two components interest on the outstanding balance and an amount of capital repayment.[6]

The conclusion that can be drawn from this is that if someone takes out a mortgage and keeps it for two or three years and then they sell it, they would have paid back nine or ten thousand ponds but this will have mainly been the repayment of the interest and not much of the actual capital repayment. But if the mortgage is sold after about 20 years then the repayments will mainly be the capital repayment. This shows that there really is no advantage to paying off a mortgage early.[5]

### Compound Interest Deception

The way that compound interest works can be used by dishonest advertisers to fool customers.

### Example

A loan of £2000 is offered at only 5% to be repaid over a year in monthly instalments.[2]

A=2000, R=175 and n=12. The monthly rate is the unknown r. So letting (1+r) =x

### Solving in Matlab

e=[80,-87,0,0,0,0,0,0,0,0,0,0,0,7];

r=roots(e)

r =

1.0076

1.0000

0.7026 + 0.4786i

0.7026 - 0.4786i

0.3600 + 0.7303i

0.3600 - 0.7303i

-0.0406 + 0.7936i

-0.0406 - 0.7936i

-0.7750

-0.6796 + 0.3767i

-0.6796 - 0.3767i

-0.4149 + 0.6643i

-0.4149 - 0.6643i

The root is 1.0076

1+r=x

1+r=1.0076

r=0.0076

r=9.12%

So looking at this it can be seen that the actual rate is not 5% per annum but actually 9%.

### Lecture 3: Quantitative Methods: Monte Carlo Simulation

The Monte Carlo method is used for numerical evaluation for integrals and is based on the use of random numbers. One of the reasons that the Monte Carlo method is popular is that it can be implemented on integrals of higher dimensions. But it does have its drawbacks one of the main ones being that it has a very slow order of convergence and it can be difficult to implement on integrals with complicated boundaries.[1]

The first part of the lecture was to show working things out numerically works just as well as using rules to integrate or differentiate a function for example:

With certain functions like f(x) =sin(x) we all know that the derivative is f'(x) = cos(x) and this can be proved as shown below. [2]

Let f(x) =y

### Applying the trig formula

When

If

It can be seen that it is very complicated and easy to make mistakes when done like this. However this can be proved much more simply than having to go through the process of deriving and the way that this can be done is numerically:

The numerical method is about finding out the relationship between the differential equations and difference equations. The two theories are related so that the coordinates only have discrete values, and the link involves values of the unknown function and values at coordinates that are close. A lot of methods like this are used to work out numerical solutions to differential problems and so find the approximation of the answer of a differential equation by the solution of a corresponding difference equation.[3]

Example to prove that f(x) =sin(x) so f'(x) = cos(x): MathCAD

The green line was just the plot of the sin wave, then the red line was the plot of the cos wave. The aim was to find the relationship between the difference equation and the differential of sin(x). Once the difference equation was plotted it could be seen that it fits almost perfectly over the cos wave and this shows that the differential of sin(x) =cos(x).

Looking at a different example f(x) =ex

Deriving this first shows that

Differentiating each power on the RHS

If, then

Now proving this numerically

Looking at the two graphs it can be seen that the plot of Et is the blue dashed line and then once the difference equation is plotted it fits over the first plot proving that the differential of ex is ex.

Dealing with simple functions like the ones above it is very easy to work it out analytically but when dealing with complicated integrals then it can be easier to apply the Monte Carlo method. [4]

To understand the Monte Carlo integration it's necessary to understand how normal integration works: If the area A is what is required within the interval [a, b].

If it is very simple then it can be achieved by summing over N points at a regular interval Δx.

So

This takes the value of f from the midpoint of each interval.

In M dimensions the interval is taken as ([a1, b1], [a2, b2],.. [aM, bM])

So

Therefore

V(M) = M dimensional volume defining the integration “area”.

N= Total number of Points

It can be said that V(M+1)= VM x average of f

So avg

Sampling for the Monte Carlo Integration is very similar to what was seen above however instead of sampling at regular intervals Δx random points are used and then the average is taken. So say that N points xi picked in the interval [a, b]:[4]

So thats how normal integration and the Monte Carlo Integration are similar and what the differences are:[4]

If you want to integrate a function where f(x) is a complicated and difficult the Monte Carlo method allows you to look at the problem as (b-a) multiplied by the average of f(x). So it is really looking at the relationship that has to do with probabilities and integrals. So x is picked at random f(x) is found and then averaged. This method can be applied to functions with many dimensions. [4]

So if the function can be easily evaluated but is difficult to integrate analytically then the Monte Carlo technique is very useful. It involves selecting a region like A and it includes the region B the area that represents the integral of the function. So Random points within region A are generated and the number of the points that fall into Region B are calculated. So the integral is basically the product of the integral of A and the fraction of randomly sampled points that are found in the Region B [4]

### Examples:

The integral below is quite simple to calculate and entering it into MathCAD returns:

If we were to use the Monte Carlo theory and work it out by:

r 0 returns the result

The for loop is for generating the 10000 random numbers

Next we set the range in which we want the random numbers as the limits are 0 and pi we can set for between 0 and 1.

Then finally we divide r by 10000

And the result can be seen below as 2.

Example 2: Taking a very simple example again it can be seen that using normal integration the correct value is easily returned when entered into software like MathCAD.

Looking at it numerically:

This time using 50000 random numbers, and setting the range for the random number between the two limits 2 and 5 and then dividing r by 50000 returns a value 39.113 that is very close to the original answer 39.

Example 3: In Matlab

y=x^(2-x)+1;

a=0;

b=10;

n=50000;

sum = 0;

for i=1:n

x=rand(a,b);

y =x;

sum=y+sum;

end

Integral = (b-a)*(sum/n)

Integral = 12.40

### Lecture 4: Eigenvalues and Eigenvectors

This lecture focused on matrices and how to solve problems with many variables using the Eigen system.

Looking at the following example:

134px-Right_angle We know that this is 90 degrees but why is it 90 degrees and why is that important? The reason that it is 90 degrees is because of the independence of x and y. So if the value of y changes then x is not affected and the same for if the value of x changes then y is not affected. And in many situations the independence of the two values is very important. Like when looking into the Theory of Induction and the dimensions of space it is important to understand the need for variables to be orthogonal.

A matrix is a set of real or complex numbers arranged in row and columns to form a rectangular array. A matrix does not have any rows but only columns. A matrix can be written as a row only when the Transpose of the matrix is taken. A matrix also can be thought of as vectors which are represented by the columns which define the space in which vectors exist. [1]

So if we were to solve a set of simultaneous equations like:

This can be solved the normal way by multiplying the equations by a number so that either x or y are the same in both equations and then can be eliminated to give the other variables solution, which can then be subbed back into the equation and solved.

However a much easier way to do this would be to simply use a matrix:

A u v

Matrices can be used to solve an equation with thousands of variables and can be applied to any mathematical problem and is a very powerful technique.

All the equations must be linearly independent and so:

M E Λ E-1

Diagonal

Where E is the Eigen vectors

Λ is the Eigen value

M is a real square matrix the symmetric Eigen problem is to find the Eigen values λ and the corresponding Eigen vectors e≠0 so[2]

The Eigen values λ are all real and the Eigen vectors can be chosen to be mutually orthogonal. So

Where Λ is a real diagonal matrix whose diagonal elements λ are the Eigen values and E is a real orthogonal matrix whose columns ei are the Eigen vectors so eiT is the transpose:

The equation can be rewritten as

Known as the Eigen decomposition or the spectral factorization of M, (Discussed in more detail below).[2]

Eigen values of a symmetrical matrix are not really that sensitive to disturbances in the original matrix M. The sensitivity depends on how small the gap is between the Eigenvalue and any other Eigenvalue so the smaller the gap the more sensitive it is. [2]

An Eigen value basically shows the proportion of total variability in a matrix linked to its corresponding eigenvector. And so the eigenvector that is related to the highest Eigen value basically shows the dimension that generates the most amount of individual variability in the variables. And so the next eigenvector will be a dimension perpendicular to the first that relates to the second largest amount of variability. [2]

The reason that it is a system that is very powerful is because when measuring a real physical system you cannot know how many physical variables there will be and this system will allow you to deal with it no matter how many variables there are. And so it can be applied to any mathematical problem.

### Eigenvalues and Eigenvectors:

The trace of a matrix is used for only square matrices and is equal to the sum of the diagonal elements of the matrix. And only square matrices are orthogonal and satisfy the equation:

So an orthogonal matrix is simply the transpose of that matrix. And the matrices of eigenvectors are orthogonal matrices and are very important.

The Eigen values and eigenvectors of a matrix are important in things like multivariate analysis. And it applies to correlation matrices and covariance matrices.

In some applications of matrices and in things like solving technological problems involving coupled oscillations and vibrations equations can sometimes take the form of:

### Proof:

When A= [aij] is a square matrix and λ is a number. So x=0 is a solution but is not helpful and so for non trivial solutions x ≠0 the values of λ called the Eigen values of the matrix A and the corresponding solutions of the equation are called the Eigen vectors.[3]

Expressing them as a set of separate equations:

So:

So if the right hand side terms are brought to the left hand side:

And so (A – λI) x = 0

And for linear equations to have a non-trivial solution:

This is known as the characteristic determinant of A and |A + λI|= 0 is the characteristic equation.[3]

And by expanding it can be seen that one term in the equation is given by and that all the other terms contain the (n − 2) power. Hence the equation can be written as:[4]

So this equation establishes two relationships between the Eigen values, the trace and determinant of a matrix. Taking an arbitrary 2×2 matrix A below, we have that:[4]

* The trace of matrix A is basically the sum of the diagonal values of the matrix above and is equal to the sum of the Eigen values.

* The determinant is basically equal to the product of the Eigen values. [4]

### Diagonal form of a matrix:

This is the process of transforming matrix A into a similar matrix which will only have non-zero elements in the diagonal. It is assumed that the roots of the characteristic equation of A are real. Then A can be put into a diagonal form as shown below.[4]

If R is a matrix with columns of Eigen vectors that are n independent of A

R={v1|v2 |...|vn]

And since the vectors are linearly independent

AR=RD where

So A is diagonalizable. If all the Eigen values are different then the Eigen vectors are linearly independent which is very important. And so in this case it can be easily seen that the matrix is diagonal. [4]

Example: In class

If you implement the formula discussed above M=EΛE-1

It can be seen that the you get the original matrix again.

And if you subract the original matrix you can see the results below:

Example: using a 3x3 matrix

Example: Trying to use a 3x2 matrix in MathCAD:

This shows that orthogaonality is very important in Eigen values and Eigenvectors.

### Lecture 5 and 6: Mini-project on Pisarenko Algorithm

### INSERT

For this example we want to find the frequencies ω1=0.3 and ω2=0.4

Using the following equation:

And plotting the results:

The area we interested in for this example is the part indicated by the arrow so:

So if we only look at the first 6 values for this:

Plotting the Circulant matrix B using the six values we saved above:

It can be seen that it is simply the phase that is being shifted and so plotting each column as below:

So we have a matrix above but we want to find the Eigen system but the matrix is not square so it can't be done when it is in this format. So by multiplying the matrix by it transpose we can find the square matrix and so go on find the Eigen system.

What can be seen from the data above is that the imaginary part of the natural logarithm of the polynomial roots of the smallest eigenvector of the covariance matrix of the circulant of the sampled time series is equal to the original frequencies. This piece of work will focus on investigating why this is true?

Example 2: Using different frequencies and different sample size:

This time we are trying to find the frequencies ω1=0.8 and ω2=0.5

This time a much larger sample size was used to see the effect it would have on the results

And again to find the Eigen values the covariance matrix had to be calculated:

And applying the same rules as before it can be seen that the original frequencies are returned.

### Circulant Matrices

One of the key parts of the process as can be seen above is constructing the circulant matrix and it is related to some aspects of Discrete Fourier Transform which is key to finding the frequencies. The importance of DFT really is from mathematical principles that relate to harmonic analysis. [1]

### The convolution formula:

And so it can be presented as:

Where:

This matrix above is a circulant matrix and is the same type used in the examples above. “The right eigenvectors are sampled harmonic signals and the left eigenvectors are the sampled conjugates”.[1]

And all circulant matrices can be represented as:

Where:

This can be used to find (f x g):

Which is the convolution theorem.

The Retrieval of Harmonics from a Covariance Function by V.F.Pisarenko [2]

### PROOF: Taken from the original Pisarenko Paper:

There are many applications that are related to sinusoidal frequency estimation like communications, speech analysis, instrumentation and measurement and for many other things. The Pisarenko's Harmonic retrieval method is one of the first Eigen structure based spectral estimation technique.[2, 3]

This method involves the covariance matrix being used to retrieve harmonics. There are many new methods that have taken this method and improved it. But this method is looking at the theorem of Caratheodory related to the trigonometrical moment problem.[2]

1. If is a complex stationary process involving a randomly generated sequence of observations and each one can be thought of as a sample of one element form a probability distribution. One of the assumptions made is that the mean value of the process will be zero. And represents the spectral function of the process and is linked to the covariance function.[2]

And if is considered a real process: Then the spectral density (depending on if is continuous):

And so if has a jump at at the frequency the equation above is still valid if are added so has a component of and also if also has a harmonic of .[2]

And then if we let

The assumption that the values are absolute and has no noise or error is made.

So the next step that we can take is to let the segment of B(k) be so that the covariance matrix B is non-degenerative:

So this means that there will be many spectral functions that will be related to the function B(k) when |k| m. [2]

In order to prove this we can let MEE:

Where is the inverse of the matrix B. It is also called the maximum entropy estimator.

Among the spectral density functions that satisfy:

And since the assumptions of B above is that it is positive and definite the matrix will also be positive definite for small values of μ. In the equation below represents the identity matrix of order m+1. [2]

By looking at the inverse of by then:

As it can be seen that this corresponds to the given segment of the covariance function so the second component of the equation above represents the in the covariance function and so the proof works.

This comes from the Caratheodory theory and if the spectral function of discrete time is based on a finite segment of the covariance function then it must be considered as even if the values are known there are still infinitely many corresponding spectral frequencies. There is no upper limit for these at any frequency. However in some situations the upper and lower limits are required and can be found for some spectral function that has been smoothed by a spectral window.[2]

### This introduces the next theory:

If is a spectral window for a real periodic function λ having a period of 2π so:

K is a constant. So the smoothed spectrum related to the original equation stated at the beginning of the paper is:

The upper and lower limit exists based on the finite segment of the covariance function:

represents the Fourier coefficients of the function :

So this can be used to find the power contribution of the continuous and also the discrete spectral components that have been removed by the spectral window.

2. A new method for finding the harmonics from a finite segment of the covariance function:

This part of the paper looks at the theorem of Caratheodory.

If are complex numbers and the assumption that then there will be some integer that is r that will exist, and real numbers where j=1,...r so that [2]

The constants have to be found uniquely

Now looking at the segment where and entering it into the theory above:

If k is set to zero:

The equality sign will only apply if the matrix is degenerative and if it isn't then the strict inequality holds. So:

And if B(k) = B*(k) then:

Where for non degenerative matrices, so overall the equation above is a representation for the segment of the function B(k) in terms of a sum of harmonics. A true covariance function will contains r harmonics and also maybe a function too. So the representation then produces the true frequencies since they are found uniquely. But in some cases there can be situations where the frequencies and the amplitudes found are not the real ones. So if B(k) is real the complex harmonics in the equation above must be even in number and must occur in complex conjugates to produce real cosine harmonics:[2]

So:

Where, if the true number of cosine harmonics in a real covariance function is not more than m/2 then they can be recovered from a segment of length 2m of the function.

The next part is looking at how the numbers are found:

Algorithm for working out the numbers discussed above:

1. The first step to doing this is to find the minimum eiegenvalue of the matrix B (order m+1). The number is equal to the amplitude for

So if then the in the equation is no longer applicable. So there are two possibilities: v=1 or v>1. I f v=1 then the matrix B- is considered. If it is the second case then the principle minor of the matrix is considered with an order of r+1 where (r=m-v):[2]

The rank of the matrix is r and the order is (r+1).

2. The eigenvector of the matrix above corresponding to the unique zero eigenvalue and the components need to be found.

3. The roots of the polynomial are evaluated:

The roots of which are represented by They are all different and of the modulus unity.

The numbers are the frequencies.

4. The amplitudes can be expressed by . So to make it easier for numerical calculations another derivation is done:

In order to find then are all different then using the difference equations:

When

If are not equal to zero or pi then the imaginary part of the equations will have to be considered.

The first matrix and the equation above determine the amplitudes uniquely. So if then the real parts can be:

The determinant of this is the second matrix above and so is not equal to zero.

The next assumption that is made is that amongst there will be one pair with identical modulus so assume and the others are different. So[2]

If

Then:

In the systems above all the moduli are different and can be found from:

So the system above is used to work out the amplitudes. For a real covariance function the equation below is used:

The moduli again are all different so the equation above can be taken for k=0,1,.... And so are uniquely found. So the algorithm proved above can be used to determine both the frequency and amplitude.

The next part is to find the connection between and the harmonics.

If the assumption that the minimum eigenvalue is simple is made then r = m and the polynomial can be found as the determinant:[2]

For μ < μo the MEE for the matrix B-μI it can be denoted by and the first row of the matrix (B-μI)-1can be denoted by :

From a previous equation it can be seen that the coefficients of can be seen as :

If the det>0 then the roots of the polynomial:

Will lie inside the unit circle in the complex z plane:

So for μ < μo:

From this we can see the roots of tend toward the roots of the polynomial as μ increases μ0.

And letting

Using the theory of complex residues:

If k<0 the inequalities are changed to the related complex conjugates and as μ increases μ0:

The conclusion drawn is that

This really describes some of the properties of the amplitude and also can be used to explain why this theory works in finding the harmonics from any white noise.

The overall conclusion for this method is that it can be used to retrieve the harmonics from a finite segment of the covariance function and can be applied to non-ergodic processes which contain spectral functions that have jumps. It is a method that can be applied to many different situations especially in practise and usually a time series that has some deterministic harmonics where the amplitudes and the frequencies are unknown.

One of the other theories used also is Euler's Formula that is quite significant to the Pisarenko proof shown above. It is the relationship between the formula and trigonometric expressions. [4]

It states that:

So it can be derived from:

And they can be found by adding or subtracting Euler's Formula:

This can then be used to work out sine or cosine.[4]

Looking at the method that we have implemented in the examples carried out in class, there seems to have been a number of methods involved in actually working out the frequencies at the end. It seems that the theory behind circulant matrices and their relationship with DFT played a major role and the finding the covariance matrix meant that we could calculate the eigenvalues and eigenvectors. Then the theory that was found by Pisarenko played the most fundamental part (proof shown above) in finding the frequencies as well as the Euler's formula.

### Lecture 7: Singular Value Decomposition

Looking at SVD is as important as looking at the previous route of using the covariance matrix to find information. SVD is just as powerful because unlike the covariance route there is no information that is lost. With the covariance method when looking at the eigensystem the left and the right hand side vectors are the same so when we go through the process one set of that information is lost permanently and that's why it's important for the two sides to be identical. However with the SVD method no information is lost at all and also this method allows you to break down non-square matrices.

If there is a matrix A which is an m x n matrix then SVD is:

Where U is the an m x m unitary matrix

Σ is an m x n diagonal matrix with real, +ve numbers on the diagonal

And V* is the conjugate transpose of an n x n unitary matrix.

This is known as the singular value decomposition of the matrix A.

The diagonal values of Σ are known as the singular values and can be thought of as Scalar controls that affect the inputs which are multiplied by it to give the output. And looking at V the columns of this form a set of orthonormal input vector for A and this is related to the A*A for which this is the eigenvectors of. As for U the columns of this are also orthonormal but instead of forming the input actually form the output vector directions for A and this is related to AA* for which this is the eigenvectors of. And for Σ the diagonal represent the square roots of the eigenvalues related to the same columns in the other two matrices. [1]

SVD Proof:[2]

This means that is the part that is the eigenvalues and its V that has the eigenvectors. So to find V the first thing that has to be done is the the eigenvalues have to be put in decresing order so:[2]

So singular values of A are:

The eigenvectors are:

If are the eigenvectors linked to the non-zero eigenvalues

And if are the eigenvectors related to the zero eigenvalues

So this leads to:

So to find Σ which is the diagonal matrix with the singular values along the diagonal:

The singualar values are arragnged in decreasing order and the zero values are placed at the end of the end of the diagonal.

To find U:

It can be said that:

So if j= 1,...,r

is a scalar and and are column vectors .

So:

The equation above can be used to find the column vectors of U and if

Where:

Then:

So what can be found now is that:

And that forms a basis for N(A).

And the orthogonal part of N(A) that R(AT). So to find U2 . And

And since U is orthogonal the vectors that form U2 are in N(AT). So since matrix V, matrix Σ and matrix U haev been found the check left is to see if it does diagonalize and equal the matrix A. [2]

Proof: Implementing the approach described above:[3]

Taking a simple example and calculating the SVD from the beginning:

Taking a simple matrix like A:

The first step is to find the eigenvalues and put in descending order and when the square root is taken the singular values are left:

So the equation is:

Solving the quadratic equation we get: c1= 40 or 10 these are the eigenvalues.

Taking the square roots to find the singular values:

The next stage is to calculate Σ using the values above:[3]

Using the eigenvalues find the eigenvectors and then use them to find the eigenvectors in order to find V*.

So substituting the two values of c1 back into the ATA matrix:

For c1 = 40

Solving the equations in the matrix we get

Dividing through by the length:

Doing the same but using c1 = 10

Solving for x2 = x1

Dividing through by the length:

The final few steps are to find U and to complete the proof:

Now to find U=AVΣ-1

So using the original formula that and using the values of U, Σ, and V* that has been calculated above we get:

And as -2x10-3 is basically equal to 0, we have actually found the original matrix A which was:

Which shows that the proof works.

Example in MathCAD:

### Lecture 8: DDR

Dimension reduction is a process of reducing the number of variables under consideration and there are two ways of doing this feature extraction or feature selection.

Feature selection is about choosing a subset of the feature there are so choosing the ones that are more informative and is used in machine learning for building robust learning models.[1]

Feature extraction is about creating a subset of new features by combining the features that already exist.[1]

Example in class:

If there is a set of data like in the graph below and it has been heavily corrupted by noise and there is really only one set of data that is of interest to us then the decorrelation and dimension reduction is one of the best ways to find the one signal that you want, as will be seen in the example:

MathCAD:

Looking at the data it's not possible to tell which signal is of interest to us.

So we first find the covariance matrix: To find the eigenvalues and the eigenvectors,

The next step is the most important:

It is the original matrix being multiplied by the eigenvectors so this results in:

So from the graph it can be seen that one signal in particular is the most clear and is separate from all the rest. This is the principle vector and it is the same size as the original matrix and the rest of them are just noise.

This can be shown by printing the mean of each of the signals most of them will be close to 0 and the one that is the real signal will be different from the others:

It can be seen that only one of the answers is really different to the rest and is much larger and this represents the real signal that is seen on the graph. This also shows how effective that this method is it has removed nearly all the noise.

This works because when looking at physical measurements you can only really see the shadow of what is actually happening and in 2D of a generating process we are actually working in a much higher number of dimensions that can't be seen. So what this method does is reprocess the higher dimensions that generate the shadow and then analyse the result.

### The key stage mentioned above:

Is actually a projection so the result is being projected onto the 8 orthogonal axes.

### Data Dimension Reduction:

There are many methods for being able to carry out dimension reduction the method above seems to be one of the most simple and effective ways of doing this.

Some other methods that have been looked at involve dimension reduction by random projection and Latent Semantic Indexing or projection based techniques to reduce the dimension like PCA.

The need for the retrieval of multidimensional data has been increasing over time and techniques to retrieve the information as accurately and as efficiently as possible have been investigated. And ideally what would happen is that the data could be reduced efficiently into a lower dimension and it should have all the properties of the original data. However in reality this does not happen and information is lost so to develop a technique that can ensure that this won't happen is very important. And one of the methods that can do this is like the one used in the example where the data is projected into the dimensions. But looking at some other methods: [2]

Singular Value Decomposition is a good and quite often used technique in dimension reduction and so in SVD as discussed in the previous section a matrix A of m by n can be decomposed into three matrices of[2]

And U and V are orthogonal which have the right and left vectors of A and Σ is the diagonal matrix that has the singular values of A and if these singular values are arranged in a way so that they are in descending order the SVD can be used to project the data onto a lower k dimensional space determined by the singular vectors relating to the k largest singular values. So it becomes:[2]

The matrix in the equation above ends up being an approximation of the original A with a rank of k. And SVD is basically rotating the axes to maximize the variance in the dimensions which contain the most important information about the data.[2]

Latent Semantic Indexing uses a similar idea but what it actually does is keep the first largest singular value and get rid of the rest and it's because this method can keep some of the similarities with a small k in comparison to the original dimensionality. [2]

If the vectors are orthogonal then the similarities between the original vectors can be preserved exactly.

### Principle Components Analysis:

PCA is the oldest method in multivariate analysis and can be known as the Karhunen-Loève transform and it went through a lot changes before finally being generalised by Loève.[3]

PCA uses the eigenvectors of the covariance matrix to find the independent axes of the data using Gaussian assumption or in some situations it can just de-correlate the axes but can face some problems as it does not take into account class separation but just carries a coordinate rotation.[3]

The aim of dimension reduction really is to map the higher dimensional data on to a lower dimensional space so that it has no losses. The PCA method is a linear technique that tries to keep the variance and relies on the solution of an eigenvalue problem involving the covariance matrix. PCA is good for keeping the original structure but it cannot save the information that keeps locality of the data.[3]

So what it does is if there is a set of data the dimensionality reduction will try to produce some data that is as close to the data X as it can be but will have a lower dimension. This method involves finding the covariance matrix of the data and then the next stage is to calculate the eigenvalues and the eigenvectors. The covariance matrix if the data is plotted will provide information on the pattern of the data and if the eigenvectors are plotted it should show how they are linked. So completing this stage what happens is you are simply finding what characterises the data. The eigenvector that has the highest eigenvalue is the most important component of the data. Then they can be ordered from the highest to the lowest then the values that are not of any importance can then be ignored so much data is not lost: So if there are n dimensions originally and n eigenvectors and eigenvalues are found then only the first p eigenvectors can be chosen and the final data set will end up having p dimensions.[3]

The next stage to this method is to find the feature vector and this is basically found by looking at the eigenvectors that is of interest to you and then a matrix is formed with these.

The last step in this method is to take the transpose of the feature matrix and multiply on the left of the original data set. This will basically give the original data but depending on the vectors that were kept later on. And if the data was orthogoanl then this will mean that the result is a lot more efficient. So the new data is set in the terms of the eigenvectors chosen and this means the dimensions have been reduced. So really the result is the original set of data that has been rotated with eigenvectors are the axes now. And this why no data is lost in the decomposition.[3]

Deriving PCA proof: [4]

A is a L dimensional vector presented as a column vector and this method is about trying to find an L x L orthonormal transformation matrix Q so:

And Cov(Y) is a diagonal marix

So then it can be seen that:

And so of Q is a L x 1 column vector so that

And cov(Y) is:

### And substituting the equations

And looking at the equation, represents the eigenvector of the covariance matrix and by finding the eigenvectors the projection of the matrix Q can be found. [4]

The Karhunen-Loève transform seems to be the closest method to the one which we have implemented in the examples above what is happening is that it basically decorrelates the components. And the proof for it can be seen below:

KLT Proof:[5]

If there is an (M + 1) stationary mean signal vector with the covariance matrix being: is defined as:[5]

This is the KLT transform

And V is a unitary matrix of eigenvectors of R and the formula for the eigenvalue is:

is the diagonal

The components of are the principle components and can be shown as:

This is kind of filtering y(n) and so the vectors vi can be called eigenfilters. The principle components are orthogonal.

So what's happening is that the KLT decorrelates the components of the vector y (n) and the eigenvalues of R are the variances and so the total variance is:

So using the trace property:

The inverse KLT is performed and so:

And it can be written as the sum of the individual principle components:

And the first few components are the ones that account for the total variance and so:

And if the eigenvalues that are ignored are small then the result should be the same as the original signal.[5]

### Week 9: Neural Networks

Neural networks usually are referring to a network or circuit of biological neurons but are also used to refer artificial neural networks made of nodes. Artificial intelligence and tries to simulate the properties of neural networks as closely as they can however are more about building mathematical models.[1]

Usually modelling of equipment and algorithms consists of computer programs that can be very complicated designed for specific applications but an alternative method is to use a neural network it is a branch of artificial intelligence and are non-linear computer algorithms that have feedback and can model the behaviour of complex nonlinear processes.[1]

These models are simplified versions of the actual networks and usually consist of an input layer, a hidden layer and an output layer:

These systems can be used to calculate any value of x and y even if the system is complicated and horrible it does the things the brain is good at doing and even though it's not really for maths it is a very effective method.

The layers indicated above are key, so there is an input layer such x and y. The output layer is what sums everything up through some weighting function and releases it:

The output is linear

The hidden layer is the most interesting in that it feeds the inputs after summing through some non-linear function:

Most of the units in the networks transform the inputs using functions that are called activation functions and some of the functions that are commonly used are:

Hyperbolic functions like tanh(x) which is

Using the Taylor series to expand this is:

Where the O is added on in the order of to the power of ten

Graph of tanh(x):

Another function that is mainly used is the sigmoid function. A sigmoid function basically produces a sigmoid curve: which is an "S" shape. These functions relate to the special case of the logistic function and include the hyperbolic tangent, arc-tangent and the error function. [2]

Sigmoid functions are often used in neural networks to introduce nonlinearity in the model or to make sure that certain signals remain within a specified range. [2]

A reason for its popularity in neural networks is because the sigmoid function satisfies this property that is

Because

Inputs that are entering get multiplied by weights and also be the characteristic equation of the neuron and the function seen above is a non-linear transfer function that helps with making the outputs more reasonable. So the non-linearity is quite important as if it wasn't then the inputs would get multiplied by the same proportions and this would mean problems for the whole system as it could cause it to drift and the non-linearity helps to isolate the input pathways.[2]

Expanding the function using the Maclaurin series:[3]

The Maclaurin series is

Linear Relationship

If you have a graph like the one below:

And the red dots represent apples and the orange dots represent oranges and what we would like to know whether we have an apple or an orange the way to do this is by just by simply separating the information using a line and with the information above this is easily done:

And for situations like this it is not necessary to use neutral networks it is too simple to apply to something like this. Even though neural Networks are not used for systems like this it could be implemented if required so only one weight would be required along with a bias weight with an output neuron.

However if the situation changed and was like the one below:

And to try and separate this is a lot more difficult the line required is shown below:

It will require a non-linear function to be able to separate the information and it is in situations like this that neural networks can work most efficiently and accurately.

This will work in any dimension also, so for 2D it can be seen that a line will separate the required information, in a 3D situation there will have to be a plane between the solutions, for a 4D there will have to be a hyper plane and for 5D a hyper hyper plane and so on.

### Example:

We need the NN to implement

Using tanh(x)

This is the non-linear function we will be using. The expansion can be seen above and we want the term and the series in infinitely long so we need as many neurons as there are terms in the series. However for this example we only require two neurons.

The two input weights will be {a, b} the outputs will be weighted by {A, B} and they will be summed before reaching the output.

The next part is to calculate the Vandermondian matrix:

The comes about from the fact that we only the term so the first term in the tanh(x) series is put to zero and the second term is multiplied to cancel out the -1/3 so that we are only left with the term we want.

The Vandermondian matrix is a matrix in which the first row has all values of 1, and entries in the ith row is the corresponding entry in the second row to the (i-1) power. This kind of matrix can be used to evaluate a polynomial such as (a0 + a1x +a2x2 +...+an-1xn-1) at a set of points and it actually transforms the coefficients of the polynomial to values the polynomial takes at the point of ai.[4]

Plotting the curves of that we want and also of the equation worked out using the matrix above the results show that the curves are identical showing that the neural network is accurate.

Picking two values for a and b:

Example 2: If we however want to find using the tanh(x) then:

Looking at the expansion of tanh(x) it can be seen that there is no so to actually find the term we put the first term to zero and simply divide the term by so to do this as can be seen above is formed and what's actually happening here is that we are effectively differentiating.

So

And as can be seen from the graph plotting and the function they fit perfectly over each other proving they are the same and so the NN function implemented actually does work.

Example 3:

For the NN diagram is:

Example 4:

NN diagram for

Homework Example 5:

Draw a NN diagram for and solve for all variables:

Using

Looking at f(x) to form the NN circuit five neurons will have to be used. Two will be needed for and three for. So it will look like:

So:

### In MathCAD:

Neural networks are mainly used for things like classification, modelling non-linear equipment and for many other processes. They are basically computer algorithms that can learn patterns and it is this that makes them well suited for modelling complex and complicated non-linear processes. They are used a lot in mathematics and can have many different architectures designed to solve particular problems. So it is a very powerful mechanism to compute any function and it opens up a huge amount of flexibility.

### Lecture 10: Universal Non-Linear Approximation by Neural networks

Any operation can be written as a polynomial and so by taking a few terms you can get a good fit by taking a pade approximate.

Pade approximate is basically the ‘best' approximation if a function by a rational function of a certain order, and they are used a lot in computer calculations.

This should link with f(x) and to the highest order which:

If the function of x is a horrible function then what can be done is so dividing two polynomials you can get a pretty good approximation and then create an infinite series with it.

So if there is a nasty function it can be measured and fit with a high order polynomial or it can just be fed into a simple regime and approximated as and then implemented into a Neural Network.

So if using then it's known that there will be four neurons required and if it's to the term then the derivative of will have to be found which is the third term and so you need three neurons.

If the function is continuous then only one hidden layer is required a neural network is a pattern classifier so you can put in a load of numbers and the NN will have a binary output. So for example if it is applied to a stock market situation a NN would tell you whether to buy or sell. Some situations will have multiple hidden layers however most situations do NOT require more than two hidden layers.

If you have a situation like the one below:

Taking the example of the stock market, the ‘buy' choice is represented by the red dots and the ‘sell' choice is represented by green dots:

Then it can be seen quite simply that the function that we want is:

So all that has happened is that we have constructed a polygon to separate the two different sets of data, so what would be happening in this is that all the lines that are intersecting and forming the polygon are the solutions that we want. So this will require just one hidden layer because it is continuous and from the shape above the solution would be to ‘buy'.

However if the function is discontinuous like a gearbox of an automatic car which is perfect for a NN system then it will require a second hidden layer that takes in the features and maps them.

An example of a discontinuous function is shown above

However if you have a situation where you do not know whether the function is continuous or discontinuous then the process that is required is that you set up a Hopfield network where every node is connected to every other node and then you add more nodes than required and the things that you do know are the number of inputs and what the outputs are. What we don't know is how many layers are required and how many neurons are within those layers. So you set up the Hopfiels network then you can use a training algorithm and use a back propagation learning algorithm so you have a big set of examples and then if you feed into it and put in the answers we require and the network is told how many inputs and outputs so the backbone adjusts the weights and then if you keep entering new data so the NN calculates again the weights you carry on entering information end up with thousands of sets of data and the NN will then adjust the weights as optimally as it can and simply remove the weights that's are nearly zero and so you will be left with the information you require.

So the input layer: maps the feature vectors to hypercube

The hidden layer 1: defines the hyper planes

The hidden layer 2: takes the hyper plane and puts together with the polyhedral

The output layer: gathers together the polyhedral and defines the parameters.

There are many different learning algorithms and most the one mainly used is the back propagation algorithm. The NN is fed data so that it can implement non-linear mapping with the pre-specified wanted output in the data. The back propagation algorithm tries to minimise the error and uses optimisation like gradient decent. So when the network is learning the weights are moved in a direction that is opposite to the gradient and by adding new sets of data so there the NN has several different inputs and outputs to investigate will improve the performance of the training process.[1]

### References:

### Lecture 1:

[1] A. Kaw. Introduction to Numerical Methods. Available: http://numericalmethods.eng.usf.edu

[2] Wikipedia. Discounted Cash Flow. Available: http://en.wikipedia.org/wiki/Discounted_cash_flow

[3] J. H. Webb. Have we caught your interest? Available: http://143.53.36.235:8080/LectureNotes/Interest.pdf

[4] I. dictionary. Interest. Available: http://www.investordictionary.com/definition/interest.aspx

[5] Wikipedia. Binomial Theoram. Available: http://en.wikipedia.org/wiki/Binomial_theorem#cite_ref-0

[6] D. University. The Maths Forum. Available: http://mathforum.org/dr.math/

### Lecture 2:

[1] R. Waterman. The continuous compounding formula derivation. Available: http://www-stat.wharton.upenn.edu/~waterman/Teaching/IntroMath99/Class04/Notes/node13.htm

[2] J. H. Webb. Have we caught your interest? Available: http://143.53.36.235:8080/LectureNotes/Interest.pdf

[3] Wikipedia. Rule of 72. Available: http://en.wikipedia.org/wiki/Rule_of_72

[4] Wikipedia. Taylor Series. Available: http://en.wikipedia.org/wiki/Taylor_series

[5] M. F. Cooperation. Advantages of Early investment. Available: http://www.mackenziefinancial.com/calc/jsp/EarlyInvesting/EarlyInvesting1.jsp

[6] Hardsoftware. Excel UDF to calculate UK Repayment Mortgages. Available: http://haresoftware.com/xlent_udf_repaymentmortgage.htm

### Lecture 3:

[1] K. Nordlund. (2006, Basics of Monte Carlo simulations. Available: http://www.docstoc.com/docs/8386111/Monte-Carlo-Simulation-Basics

[2] Stroud, et al., Engineering Mathematics, 4th ed.: MacMillan Press LTD.

[3] Wikipedia. Differential equation. Available: http://en.wikipedia.org/wiki/Differential_equation

[4] H. College, "Monte Carlo Techniques."

### Lecture 4:

[1] Wikipedia. Matrix (mathematics).

[2] Unknown. Introduction to matrix Algebra. Available: http://www.vanderveld.nl/teaching/2005-2006%20RM%20MVA/02.%20Introduction%20to%20Matrix%20Algebra.doc.

[3] Stroud, et al., Engineering Mathematics, 4th ed.: MacMillan Press LTD.

[4] Thomas and Emma. Matrix Transformations. Available: http://www.maths.surrey.ac.uk/explore/emmaspages/option3.html

### Lecture 5 and 6:

[1] Unknown(William). The Discrete Fourier Transform. Available: http://www.cs.unm.edu/~williams/cs530/dft.pdf

[2] Pisarenko, et al., "The Retrieval of Harmonies from a Covariance Function," Geophys. J. R. ustr. SOC, pp. 347-366.

[3] A. Dictionary, "stochastic," ed.

[4] Wikipedia. Euler's formula.

### Lecture 7:

[1] Unknown. Available: http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm

[2] J. S. Hourigan and L. V. McIndoo. The Singular Value Decomposition. Available: http://online.redwoods.cc.ca.us/instruct/darnold/laproj/fall98/jodlynn/report2.pdf

[3] D. E. Garcia. Singular Value Decomposition (SVD)A Fast Track Tutorial.

### Lecture 8:

[1] R. Gutierrez-Osuna. Introduction to Pattern Recognition.

[2] J. Lin and D. Gunopulos. Dimensionality Reduction by Random Projection and Latent SemanticIndexing.

[3] L. I. Smith. (February 26, 2002, A tutorial on Principal Components Analysis. Available: http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf

[4] Wikipedia. Principal component analysis. Available: http://en.wikipedia.org/wiki/Principal_component_analysis#Derivation_of_PCA_using_the_covariance_method

[5] S. J. Orfanidis. SVD, PCA, KLT, CCA, and All That.

### Lecture 9:

[1] Wikipedia. Neural Networks. Available: http://en.wikipedia.org/wiki/Neural_network

[2] S. Kudesia. (12/10/2005, Artificial Intelligence - sigmoid function. Available: http://en.allexperts.com/q/Artificial-Intelligence-2506/sigmoid-function.htm

[3] Wikipedia. Taylor Series. Available: http://en.wikipedia.org/wiki/Taylor_series

[4] Wikipedia. Vandermonde matrix. Available: http://en.wikipedia.org/wiki/Vandermonde_matrix

### Lecture 10:

[1] A. Booth. Using Neural Networks to Improve Behavioural Realism inDriving Simulation Scenarios