**Advice: **
Roundoff error

"Real" numbers are generally represented in computers using floating-point arithmetic, which is similar to so-called "scientific notation". While most computers use a base-2 system in their "hardware floats", Maple uses base 10. Thus a nonzero floating-point number could be represented in the form
**(+ or -) r*10^k**
where
, the exponent, is an integer and
, the mantissa, is in the interval
and has some number
of decimal digits. Actually Maple's internal representation of the number is
**Float(a,b)**
where
and
are integers. If the number is produced by floating-point arithmetic, then
is at most the value of the environment variable
**Digits**
at that time.

Roundoff error is caused by the fact that numbers are generally not representable exactly with
**Digits**
digits, and thus must be approximated by numbers that are represented in this way. The procedure that is used, known as rounding, determines the closest number that can be represented. In particular, the result of an arithmetic operation on floating-point numbers generally requires rounding. Suppose, for example, you were using
**Digits=4**
, and you wanted to multiply
(represented to 4 digits as
**.3142*10^1**
) by 76.54 (
**.7654*10^2**
). You calculate
**.3142*.7654=.24048868**
, and round this to the four-digit number
**.2405**
. Thus the result, after taking care of the powers of 10, is 240.5 (
**.2405*10^3**
).

`> `
**Digits:= 4: evalf(Pi*76.54);**

Maple's built-in functions are also subject to roundoff error when they operate on floating-point numbers. In most cases what is computed is an approximation to the function (using a few more digits than Digits), and then the result is rounded. This usually results in the correct rounded value of the function, but once in a while it may be off by a digit or two. In the example below, the correct value with Digits=14 would end in 27, but the computed value ends in 31.

`> `
**x := 1.5707963267949:
evalf(tan(x),14);**

`> `
**evalf(tan(x),20); evalf(",14);**

The error in a calculation can be measured in two ways: the absolute error is the magnitude of the difference between the true value and the approximation, while the relative error is the absolute error divided by the magnitude of the true value. In many cases, the relative error is the more useful (of course, it is not much use when the true value is 0).

Multiplication and division do not cause too many problems of accuracy, because they have little effect on relative error. For addition and subtraction, on the other hand, there are two important problems to note.

The first is that when two numbers of very different magnitudes are added, many of the digits of the smaller one will have no effect on the result. This means that calculating a sum of very many, very small quantities can lead to very inaccurate results.

`> `
**Digits:= 4: add(4./9000,i=1..900);**

The correct result, of course, would be
. But when the total gets to .1, each addition of 4./9000 = .000444... with
**Digits = 4**
has only the effect of adding .0004. Even more strikingly, when the total gets to 1, further additions have no effect at all.

`> `
**add(4./9000,i=1..9000);**

The second problem is that subtraction of two nearly equal quantities will result in a large relative error. For example,still with
**Digits = 4**
:

`> `
**4.322-4.321;**

This in itself is a completely correct calculation, with no roundoff. But if one or both of the numbers 4.322 and 4.321 are approximations for other numbers, the relative error in the difference may be very large. For example, the true values might be 4.3216 and 4.3214 respectively, so the true difference is .0002, for a huge relative error of 4. One place where this can cause unexpected problems is in the standard formula for solving a quadratic equation:

Suppose
is positive, and large compared to
and
. Then
will be nearly
. Although it's fine for
, the formula can produce a relatively very inaccurate
. We'll try it for
and
with
**Digits = 10**
.

`> `
**Digits:= 10:
r[`-`]:= b -> (-b - sqrt(b^2-4))/2:
r[`+`]:= b -> (-b + sqrt(b^2-4))/2:
blist:= [130., 13000., 1300000.]:**

Here are the results for at these values:

`> `
**map(r[`-`],blist);**

Using
**fsolve**
with
**Digits = 20**
confirms that these are accurate to nearly 10 digits (the final digits of
and
are off by 1).

`> `
**Digits:= 20: map(b -> min(fsolve(x^2+b*x+1,x)),blist);
Digits:= 10:**

However, the results are much poorer in relative accuracy.

`> `
**map(r[`+`],blist);**

According to
**fsolve**
, we have five correct digits for
, one for
and none for
.

`> `
**Digits:= 20: map(b -> max(fsolve(x^2+b*x+1,x)),blist);
Digits:= 10:**

A better formula for in this case exploits the fact that the product of the two roots is 1.

`> `
**s:= unapply(normal(1/r[`-`](b)),b);**

`> `
**map(s,blist);**

There is no subtraction of nearly equal quantities here, and the results have 10, 9 and 10 correct digits respectively.

**See also:**
Automatic simplification and evalf
,
__Digits__
,
__evalf__
,
Exact vs. floating-point computations
,
Forcing floating-point arithmetic
,
__The meaning of Digits__

**Maple Advisor Database, **
R. Israel 1997