Advice: Roundoff error

"Real" numbers are generally represented in computers using floating-point arithmetic, which is similar to so-called "scientific notation". While most computers use a base-2 system in their "hardware floats", Maple uses base 10. Thus a nonzero floating-point number could be represented in the form (+ or -) r*10^k where k , the exponent, is an integer and r , the mantissa, is in the interval .1 <= r*`<`*1 and has some number d of decimal digits. Actually Maple's internal representation of the number is Float(a,b) where a = r*10^d and b = k-d are integers. If the number is produced by floating-point arithmetic, then d is at most the value of the environment variable Digits at that time.

Roundoff error is caused by the fact that numbers are generally not representable exactly with Digits digits, and thus must be approximated by numbers that are represented in this way. The procedure that is used, known as rounding, determines the closest number that can be represented. In particular, the result of an arithmetic operation on floating-point numbers generally requires rounding. Suppose, for example, you were using Digits=4 , and you wanted to multiply pi (represented to 4 digits as .3142*10^1 ) by 76.54 ( .7654*10^2 ). You calculate .3142*.7654=.24048868 , and round this to the four-digit number .2405 . Thus the result, after taking care of the powers of 10, is 240.5 ( .2405*10^3 ).

> Digits:= 4: evalf(Pi*76.54);


Maple's built-in functions are also subject to roundoff error when they operate on floating-point numbers. In most cases what is computed is an approximation to the function (using a few more digits than Digits), and then the result is rounded. This usually results in the correct rounded value of the function, but once in a while it may be off by a digit or two. In the example below, the correct value with Digits=14 would end in 27, but the computed value ends in 31.

> x := 1.5707963267949:


> evalf(tan(x),20); evalf(%,14);



The error in a calculation can be measured in two ways: the absolute error is the magnitude of the difference between the true value and the approximation, while the relative error is the absolute error divided by the magnitude of the true value. In many cases, the relative error is the more useful (of course, it is not much use when the true value is 0).

Multiplication and division do not cause too many problems of accuracy, because they have little effect on relative error. For addition and subtraction, on the other hand, there are two important problems to note.

The first is that when two numbers of very different magnitudes are added, many of the digits of the smaller one will have no effect on the result. This means that calculating a sum of very many, very small quantities can lead to very inaccurate results.

> Digits:= 4: add(4./9000,i=1..900);


The correct result, of course, would be .4 . But when the total gets to .1, each addition of 4./9000 = .000444... with Digits = 4 has only the effect of adding .0004. Even more strikingly, when the total gets to 1, further additions have no effect at all.

> add(4./9000,i=1..9000);


The second problem is that subtraction of two nearly equal quantities will result in a large relative error. For example,still with Digits = 4 :

> 4.322-4.321;


This in itself is a completely correct calculation, with no roundoff. But if one or both of the numbers 4.322 and 4.321 are approximations for other numbers, the relative error in the difference may be very large. For example, the true values might be 4.3216 and 4.3214 respectively, so the true difference is .0002, for a huge relative error of 4. One place where this can cause unexpected problems is in the standard formula for solving a quadratic equation:
r[`-`] = (-b-sqrt(b^2-4*a*c))/(2*a), r[`+`] = (-b+s...

Suppose b is positive, and large compared to a and c . Then sqrt(b^2-4*a*c) will be nearly b . Although it's fine for r[`-`] , the formula can produce a relatively very inaccurate r[`+`] . We'll try it for a = c `` = 1 and b = 130, 13000, 1300000 with Digits = 10 .

> Digits:= 10:
r[`-`]:= b -> (-b - sqrt(b^2-4))/2:
r[`+`]:= b -> (-b + sqrt(b^2-4))/2:
blist:= [130., 13000., 1300000.]:

Here are the results for r[`-`] at these b values:

> map(r[`-`],blist);

[-129.9923072, -12999.99992, -1300000.000]

Using fsolve with Digits = 20 confirms that these are accurate to nearly 10 digits (the final digits of r[`-`](130) and r[`-`](13000) are off by 1).

> Digits:= 20: map(b -> min(fsolve(t^2+b*t+1,t)),blist);
Digits:= 10:

[-129.99230723708768288, -12999.999923076922622, -1...
[-129.99230723708768288, -12999.999923076922622, -1...
[-129.99230723708768288, -12999.999923076922622, -1...

However, the r[`+`] results are much poorer in relative accuracy.

> map(r[`+`],blist);

[-.769275e-2, -.75e-4, 0.]

According to fsolve , we have five correct digits for b = 130 , one for b = 13000 and none for b = 1300000 .

> Digits:= 20: map(b -> max(fsolve(t^2+b*t+1,t)),blist);
Digits:= 10:

[-.76927629123171163209e-2, -.76923077378243064103e...
[-.76927629123171163209e-2, -.76923077378243064103e...

A better formula for r[`+`] in this case exploits the fact that the product of the two roots is 1.

> s:= unapply(normal(1/r[`-`](b)),b);

s := proc (b) options operator, arrow; -2*1/(b+sqrt...

> map(s,blist);

[-.7692762912e-2, -.7692307736e-4, -.7692307692e-6]...

There is no subtraction of nearly equal quantities here, and the results have 10, 9 and 10 correct digits respectively.

See also: Automatic simplification and evalf , Digits , evalf , Exact vs. floating-point computations , Forcing floating-point arithmetic , The meaning of Digits

Maple Advisor Database, R. Israel 1997