Floating point data type

The floating point data type allows you to store numbers in floating point form.

This means that the numbers behave like numbers written in scientific notation—they can have not only a number, but also a base and an exponent.

Floating point numbers are particularly appropriate for physics and astronomical calculations—calculations where the result is either very small or very large.

Floating point data type formatting lets you control the number of digits in the output, or to pad with spaces on the right or on the left.

BCD numbers are generally superior to floating point numbers for most applications. There are three principal differences between these two types:

Base 2 versus base 10

Floating point numbers are represented internally as binary (base 2) numbers. They provide precise representation of fractional numbers that are powers of 2 (1/2, 1/4, 1/8, 1/16, and so forth), but they do not provide precise representation of fractions that are powers of 10 (1/10, 1/100, 1/1000). Any fraction that can be precisely represented in base 2 can be precisely represented in base 10, but not vice versa. (There are, of course, many fractions that cannot be precisely represented in either base 2 or base 10—1/3 for example.)

Limited size versus unlimited size

Floating point numbers are of a limited size and are represented by a fixed number of bytes of memory. BCD numbers, as implemented by the OmniMark BCD library, are of unlimited size.

Floating point versus fixed point

Floating point numbers are limited in precision.

Floating point numbers, as their name implies, have a floating decimal point. That is, floating point numbers have a fixed number of significant bits which are distributed between the whole number portion and the fractional portion of the number. The larger the whole number portion of the number, the fewer bits are available for the fractional part.

Mixing floating point and integer values

You can mix integer variables and floating point variables in mathematical expressions. Thus, you can write:

  import "omfloat.xmd" unprefixed
  
  process
     local float   price initial { 6.37 * float 10 ** 3 }
     local float   total
     local integer quantity initial { 3 }
  
     set total to quantity * price
     output "Total = " || "d" % total || "%n"
  ;Output: "Total = 19110"

Note that if you perform an operation on two integers and assign the result to a floating point number, the operation will be done as an integer operation and the result will be coerced to a float. Thus the following code will fail, even though a float can hold the result of 1000000 * 2000000:

  import "omfloat.xmd" unprefixed
  
  process
     local integer large   initial { 1000000 }
     local integer larger  initial { 2000000 }
     local float   largest
  
     set largest to large * larger
     output "Largest = " || "d" % largest || "%n"
  ;Output: "Largest = -1454759936" (This is incorrect.)

In this case, the result of the integer operation large * larger will overflow before the coercion to a floating point number. The correct way to code this operation is to force one of the operands to float before the operation is performed. This causes the operation to be performed as a floating point operation, returning a floating point value:

  import "omfloat.xmd" unprefixed
  
  process
     local integer large   initial { 1000000 }
     local integer larger  initial { 2000000 }
     local float   largest
  
     set largest to float large * larger
     output "Largest = " || "d" % largest || "%n"
  ;Output: "Largest = 2000000000000" (This is correct).

Supported operators

You can use the following operators with floating point numbers:

+
-
*
/
modulo
abs
ceiling
floor
round
truncate
<
>
<=
>=
=
!=
%

Handling floating point errors

In the event of an error in a calculation, the Floating Point library will return NaN. NaN means "Not a Number".

  import "omfloat.xmd" unprefixed
  
  process
     local float  total initial { 2.2 }
     local string foo   initial { "foo" }
  
     set total to total + foo
     output "Total = " || "d" % total || "%n"
  ; Output: "Total = NaN"
  ;   Note: "NaN" means "Not a Number"

Related Topics