Representing integers and real numbers


Q. How many computer scientists does it take to change a lightbulb?
A. 1.99999999. One to change it and one to fix the rounding bug.

Q. How many computer scientists does it take to change 127 lightbulbs?
A. -128. 127 to change the bulbs and 1 to confirm there are no problems with the code.

Representing Whole Numbers (Integers)

See also discussion on Wikipedia Unsigned Integers

are simply positive whole numbers (0,1,2,3,4....). They are great for quantities, enumeration, counting and such like. And representing them is as easy as pie now we have our binary representation. The only thing to think about is how many bits we need to use..... and this depends on how big an integer we want to represent. Here are some numbers of bits and the largest number you can represent (we're starting at zero). The numbers should look familiar - it's the number of "things" we can represent (2^bits) - 1 (since we also need to represent zero):
Largest Unsigned Integer
In practice, most systems use 32 bit or 64 bit integers nowadays (but watch for the gotcha..... 4 billion isn't that much when you're counting memory addresses or budget deficits!)
Think about (in decimal) telephone numbers, IP addresses .... we constantly underestimate the capacity for growth especially in technology

Signed Integers get just a teeny bit more complicated. How do we represent -1, -2, -3.....? The obvious way is to "steal" a bit to use to signify sign. So here is a suggestion:
 1 00000001
 2 00000010
-1 10000001
-2 10000010
Well that's one way to do it. Note that we chop in half the number of positive integers we can represent now. What are the lowest and highest numbers you can represent this way, in 8 bits? What is the funny quirk in this system regarding zero? It turns out there are a few more problems with this way of representing signed integers, once we start adding and subtracting. Let's add two numbers in binary (decimal shown too):
  00001010       10
+ 00000011     +  3
  --------       --
  00001101       13
But what if we need to add a positive and negative integer?
  00001010       10
+ 10000011     + -3
  --------       --
  10001101      -13
Oops. That's why we need something else - and it turns out the something else is the 2's complement. To calculate the 2's complement of a number, we subtract it from 2^n, where n is the number of bits we are using. This makes the weight of the sign bit -2^n-1, rather than 2^n-1 for positive integers, or a non-valued -1 in 1's complement. A simple way to calculate the two's complement of a binary number is toflip all the bits and add 1. So to calculate our two's complement of 3:
  00000011   3
  11111100   BITS FLIPPED
  11111101   ADD 1
And this now becomes your representation for 3. How does this help us? Well, now our addition works!
  00001010       10
+ 11111101     + -3
  --------       --
  00000111        7
Note the discarded carryover. Note:
  • 2'sC of 00000000 is 00000000 (ignoring carryover)
  • The leftmost bit is still a sign bit; but watch what happens at +127, +128, +129...
+127  01111111
+128  10000000
+129  10000001
0     00000000
-1    11111111
-2    11111110
-3    11111101
-4    11111100
-127  10000001
-128  10000000
-129  01111111
So we "overlap" at +/- 128. By convention then, we "overflow" at +128, leaving -128 as a valid number. So we can now represent -128...+127.
What is 10000000 2's C?

Also note that if you add two numbers such as 63 + 65, you get the following:

  00111111       63
+ 01000011     + 67
  --------      ---
  10000010      130
But, in 2C 10000010 represents -126 and not 130. How does the adder in the hardware know whether the answer is indeed -126 or that something has gone horribly wrong (i.e., an overflow as occurred)? Notice the sign bits. We have added two positive numbers (which start with a 0) and obtained a negative number (which starts with a 1). That tells the adder that an overflow has occurred. A similar thing can happen while adding two negative numbers. If the adder notices the addition of two negative numbers (which start with a 1) has yielded a positive number (which starts with a 0), it realizes an overflow has occurred here as well.

So that's it.... just one gotcha: look out or terminology that might not be consistent (word, long, longword, longint, integer, etc) in terms of the number of bits

Here are some exercises.... figure out the signed integer (16 bit) representations for:
  • -14 (a very cold day)
  • 10 (the number of grants I'd like to win next year)
  • -40 (the temperature at which Fahrenheit and Celsuis are the same)

Representing Fractional Numbers (Real Numbers)

Here is some C code that assigns the number 0.37 to a variable, then tried to print it to 20 decimal places:
#include <stdio.h>
#include <math.h>
int main()
    float f = .37;
    printf("%.20f\n", f);
So, it should output, the following, right?
But here is the real output:
Here's some more code in C
#include <stdio.h>
#include <math.h>
int main()
    float index;
    for (index = 0.0; index != 1.0; index += 0.1)
        printf("%f\n", index);
What does it do??? What should it do????

We'll come to what is going on later, but let's start to look at what real numbers are, and how we represent them.

A real number can be an integer or a number which is between integers - i.e. it has a decimal point with digits after it. Real numbers can be rational (fractions such as 3/8 or 1/4, and integers) or irrational (can't be expressed as a simple integer fraction, such as pi). Both kinds of numbers can have infinite numbers of digits. Here are some real numbers:
Clearly this poses some challenges for representation (how do we represent an infinite number of digits???). Obviously we're going to have to live with approximation. The simplest way of representing these numbers is just to extend our binary representation to handle negative exponents (just like in Decimal - the digit positions represent 10^-1, 10^-2, and so on. So let's say we take an 8 bit (unsigned for simplicity) binary representation and extend it by an extra 3 bits to represent fractions. Now our bit positions will be:

But the problem with this is that with this fixed point binary representation, we're pre-determining the allocation of bits to after and before the decimal point. This isn't a generalizable system - we might want to represent 100000000000.1 or 0.000000000000001. The practical solution that is almost always used is to represent real numbers as floating point numbers that represent a given number of digits, but vary as to where the decimal point is placed, with an exponent multiplier being used to recover the original number. For example, if we were using a decimal representation with a fixed number of 15 digits, we might think of the above numbers as:
0.25000000000000  =  0.25000000000000 X 10^0
0.33333333333333  =  0.33333333333333 X 10^0
3.14159265358979  =  0.31459265358979 X 10^1
10356425.3281942  =  1.03564253281942 X 10^7
This is the basis of a tool available in mathematics called scientific notation. For example, the following are numbers represented in scientific notation:
Ordinary decimal notation
Scientific notation (normalised)
−0.000 000 006 1

Some real life examples:
  • Estimate of the number of stars in the observable universe: 9.0 × 10^21
  • Size of the universe: 8.8 × 10^26 meters
  • Radius of a Hydrogen atom: 2.5 × 10^-11 meters

This is effectively how we represent real numbers on computer, it's just we have to do it in binary. So in binary, we reserve a given number of bits to represent a "normalized" number (known as the mantissa, and incorporating potentially both integer and fractional position values) and more bits to represent an exponent that gives the original number when applied to the mantissa. Normally, the mantissa is normalized so that the most significant bit (i.e. leftmost bit) is a "1", meaning the decimal value of the mantissa is between 1 and 2 (IEEE standard). This number is just like regular binary, but most of the positions represent the values 2^-1, 2^-2 and so on. The only difference is that the base is 2, and the exponent is modified (with a bias) to enable positive and negative exponents. Also, since we can no longer do straight binary arithmetic, the requirement to use 2's complement is removed, so most floating point representations use a simple sign bit (although not all). So in general we have (where the sign value is -1 or +1):

value = (sign value ) x mantissa x 2 ^ (exponent - bias)

To make matters even more complex, when storing floating point numbers we use an implicit "leading most significant bit" of 1, as by the normalization method the first bit of the mantissa will always be 1. Also, we have some special cases where the exponent is all "1's" to indicate special conditions such as Infinity, -Infinity or NaN (not-a-number)

There are some standards for representing 32 bit ("single precision") and 64 bit ("double precision") numbers:

bits in sign
bits in exponent
bits in mantissa
total bits
So let's try a couple of examples: we'll work out the floating point representation (32 bit) for the following numbers. To make it easy we'll use the online converter.
  • 1, 2, 4, 8
  • 0.25, 0.5, 0.75, 2.0625
  • 0.1, 0.2, 0.3
What's going on?????? well fractions in base 2 are different than fractions in base 10!


Well you're probably very confused by now. But here are the pieces we just put together:
  • Real (fractional) numbers + positional representation = approximation for some numbers (ones composed of powers of the base are generally OK)
  • Real numbers in binary are done in the same way as decimal, except the column values after the decimal point are 2^-1, 2^-2... instead of 10^-1, 10^-2, etc
  • The numbers that can be represented exactly in decimal are different than those that can be represented exactly in binary (e.g. 0.1)
  • Using scientific notation makes representing big/small real numbers much more efficient (mantissa * base ^ exponent)
  • For IEEE floating point binary numbers, we always put in a 1.xxxx * 2^yyyy format, So we don't need to represent the "1." explicitly. The exponent representation adds a bias (2^n-1)-1 to enable simple representation of negative exponents

Floating point roundoff errors are something you will have to live with when working with Excel documents. Review Microsoft's instructions on workarounds.

Can we solve the riddle of the errors of the above code now? What should we have done? See the Golden Rules below.

There is a way of solving this problem: but it's really just too inefficient (forcing the computer to think like us!)

Revisiting Ariane 5

From the full report on the Ariane 5 explosion:

  • The launcher started to disintegrate at about H0 + 39 seconds because of high aerodynamic loads due to an angle of attack of more than 20 degrees that led to separation of the boosters from the main stage, in turn triggering the self-destruct system of the launcher.
  • This angle of attack was caused by full nozzle deflections of the solid boosters and the Vulcain main engine.
  • These nozzle deflections were commanded by the On-Board Computer (OBC) software on the basis of data transmitted by the active Inertial Reference System (SRI 2). Part of these data at that time did not contain proper flight data, but showed a diagnostic bit pattern of the computer of the SRI 2, which was interpreted as flight data.
  • The reason why the active SRI 2 did not send correct attitude data was that the unit had declared a failure due to a software exception.
  • The OBC could not switch to the back-up SRI 1 because that unit had already ceased to function during the previous data cycle (72 milliseconds period) for the same reason as SRI 2.
  • The internal SRI software exception was caused during execution of a data conversion from 64-bit floating point to 16-bit signed integer value. The floating point number which was converted had a value greater than what could be represented by a 16-bit signed integer. This resulted in an Operand Error. The data conversion instructions (in Ada code) were not protected from causing an Operand Error, although other conversions of comparable variables in the same place in the code were protected.
  • The error occurred in a part of the software that only performs alignment of the strap-down inertial platform. This software module computes meaningful results only before lift-off. As soon as the launcher lifts off, this function serves no purpose.
  • The alignment function is operative for 50 seconds after starting of the Flight Mode of the SRIs which occurs at H0 - 3 seconds for Ariane 5. Consequently, when lift-off occurs, the function continues for approx. 40 seconds of flight. This time sequence is based on a requirement of Ariane 4 and is not required for Ariane 5.
  • The Operand Error occurred due to an unexpected high value of an internal alignment function result called BH, Horizontal Bias, related to the horizontal velocity sensed by the platform. This value is calculated as an indicator for alignment precision over time.
  • The value of BH was much higher than expected because the early part of the trajectory of Ariane 5 differs from that of Ariane 4 and results in considerably higher horizontal velocity values.

The Patriot Missile Failure

On February 25, 1991, during the Gulf War, an American Patriot Missile battery in Dharan, Saudi Arabia, failed to intercept an incoming Iraqi Scud missile. The Scud struck an American Army barracks and killed 28 soliders. A report of the General Accounting office, GAO/IMTEC-92-26, entitled Patriot Missile Defense: Software Problem Led to System Failure at Dhahran, Saudi Arabia reported on the cause of the failure. It turns out that the cause was an inaccurate calculation of the time since boot due to computer arithmetic errors. Specifically, the time in tenths of second as measured by the system's internal clock was multiplied by 1/10 to produce the time in seconds. This calculation was performed using a 24 bit fixed point register. In particular, the value 1/10, which has a non-terminating binary expansion, was chopped at 24 bits after the radix point. The small chopping error, when multiplied by the large number giving the time in tenths of a second, lead to a significant error. Indeed, the Patriot battery had been up around 100 hours, and an easy calculation shows that the resulting time error due to the magnified chopping error was about 0.34 seconds. (The number 1/10 equals 1/24+1/25+1/28+1/29+1/212+1/213+.... In other words, the binary expansion of 1/10 is 0.0001100110011001100110011001100.... Now the 24 bit register in the Patriot stored instead 0.00011001100110011001100 introducing an error of 0.0000000000000000000000011001100... binary, or about 0.000000095 decimal. Multiplying by the number of tenths of a second in 100 hours gives 0.000000095×100×60×60×10=0.34.) A Scud travels at about 1,676 meters per second, and so travels more than half a kilometer in this time. This was far enough that the incoming Scud was outside the "range gate" that the Patriot tracked. Ironically, the fact that the bad time calculation had been improved in some parts of the code, but not all, contributed to the problem, since it meant that the inaccuracies did not cancel.
The following paragraph is excerpted from the GAO report.

The range gate's prediction of where the Scud will next appear is a function of the Scud's known velocity and the time of the last radar detection. Velocity is a real number that can be expressed as a whole number and a decimal (e.g., 3750.2563...miles per hour). Time is kept continuously by the system's internal clock in tenths of seconds but is expressed as an integer or whole number (e.g., 32, 33, 34...). The longer the system has been running, the larger the number representing time. To predict where the Scud will next appear, both time and velocity must be expressed as real numbers. Because of the way the Patriot computer performs its calculations and the fact that its registers are only 24 bits long, the conversion of time from an integer to a real number cannot be any more precise than 24 bits. This conversion results in a loss of precision causing a less accurate time calculation. The effect of this inaccuracy on the range gate's calculation is directly proportional to the target's velocity and the length of the the system has been running. Consequently, performing the conversion after the Patriot has been running continuously for extended periods causes the range gate to shift away from the center of the target, making it less likely that the target, in this case a Scud, will be successfully intercepted.

See the link for this


  • Get the right size. Always check how many bits a given kind of variable (long, int, double, float, etc) uses in a particular language / operating system combination, and make sure that maximum and minimum numbers you want to represent are within the bounds of the variable type (using this table).
  • Use integers instead of floating points whenever possible (e.g. represent money in cents rather than dollars). If you have to use floats, use double precision.
  • Use floating point numbers very carefully. Use the best precision for the situation. Never use a floating point number as a loop counter. Avoid repeatedly adding or subtracting fractions a large number of times
  • Avoid mixing number types (signed, unsigned; integer, floating point; single precision, double precision) where possible. If you have to convert, do an explicit conversion even when it seems redundant (e.g. an abstract type such as "color")
  • Check for overflow situations. Always throw an exception in case of overflow (manual check if needed); otherwise everything can go really badly wrong (including divide by zero!)

More details

Floating point (Princeton)
Floating point basics
The trouble with rounding floating point numbers