[ Home : Programs | libCVD | Hardware hacks | Publications | Teaching | TooN | Research ]

Floating point arithmetic in C++ templates.


The template mechanism in C++ allows classes and functions to be templated on arbitrary types and integer values. It does not allow templating on floating point values. There are many good reasons for this, and they generally relate to the lack of guarantees about floating point arithmetic. As a result, subtle differences (for instance in the compiler version, the CPU settings, etc) could result in different classes being instantiated and different code being compiled. Even worse, CPUs such as the x86 have one part internally which works in 80 bit precision (the x87 unit), and another part which works internally in 64bit precision (the SSE2 unit). If the compiler were to use both units (gcc can compile code to use both units), then even the same operations in the same bit of code could produce different results, and cause different templates to be instantiated. There are probably other good reasons too.

But sometimes, you want something baaaaad. Sometimes you want floating point templates, and all the hassles that might come with them. I did. And now you too can experience all the hassles too (actually there are rather fewer hassles, since the implementation of the arithmetic is guaranteed to be the same, whatever the compiler).

And, of course, there's the main reason: for fun!

Floating point representation

In my scheme, a floating point number is represented by a class with the following members:

struct this_is_a_floating_point_number
    static const unsigned int mant = ???;
    static const int          expo = ???;
    static const bool         posi = ???;
And the resulting number is (2*posi-1) * mant * 2 expo. Note that IEEE floating point numbers are stored at (1.)xxxxxxxxx (the 1. is implied), where as in my scheme, this is stored explicitly as an integer. I did also experiment with having a signed mantissa, but that ended up being more trouble than it was worth. Further, there is no current representation of additional values, such as inf and nan. These could easily be added to the floating point representation, but extra logic would have to be added to every operation.

Making floating point types

Floating point types can be constructed by hand, or they can be generated by several types templated on several integers. These are: There are also several predefined values in the Nums namespace:

Simple unary operations

In general, multiplication by an integer power of 2 is easy and accurate (only addition to the exponent is required). This result is used extensively in later algorithms.

Binary numerical operations


A normalized number is one where the highest bit of the mantissa is set. Most operations normalize numbers before operating on them. Normalization allows the maximum precision to be used in computations.


This is the most complex of the basic operations. After normalization, the number with the smallest exponent is denormalized so that the exponents of the two numbers are the same. Then the mantissa's are added, taking in to account the sign bit. All the cases of sign and relative sized of numbers have to be taken in to account. There's also a bunch of extra logic in there to prevent shifts that are too large, otherwise the compiler prints out a large number of warnings.

Division, reciprocation



Division is performed using reciprocation and multiplication. Reciprocation works by first denormalizing the reciprocee so that the mantissa is only takes up half of the integer used to represent it. Then integer division is performed between the mantissa of a normalized one and the denormalized mantissa. This gives a (quite poor) approximation of the reciprocal (with suitable arithmetic on the exponent). Then, some steps of Newton's method are used to refine the answer.

If we wish to find x such that 1/x - a = 0, then by Newton's method:

xn+1 = 2xn - ax2

Which conveniently does not require any more divisions. The final result seems accurate enough.

Error checking

The template code is structured in such a way that a sensible (ish) error message is produced with gcc, at any rate. It works like this: As a result, for the expression:
print<divide<fp<1>, fp<0> > >();
the error messages will look like this (highlighted for clarity):
divide.h: In instantiation of 'FPTemplate::divide_error<FPTemplate::fp<1, 0, true>, FPTemplate::fp<0, 0, true>, 0>':
fp_template.h:58:   instantiated from FPTemplate::divide<FPTemplate::fp<1, 0, true>, FPTemplate::fp<0, 0, true> >
fp_template.h:58:   instantiated from void FPTemplate::print() [with T = FPTemplate::divide<FPTemplate::fp<1, 0, true>, FPTemplate::fp<0, 0, true> >]
ex2.cxx:13:   instantiated from here
divide.h:76: error: no type named 'divide_by_zero' in 'struct 
   FPTemplate::error_check<FPTemplate::fp<1, 0, true> >'

Binary tests

All these tests have a static boolean member, val with the result in. The tests are:

Elementary functions

Exponentiation: exponentiate<num>

This function works using a (rather inaccurate) algorithm, and is more proof-of-concept than anything else. This function uses the well known Taylor series for calculating ex:

ex = 1 + x + x2/2! + x3/3! + ...

A Taylor series with a fixed number of terms is only accurate close to the point of expansion. In order to ensure x is small, I use the following result:

ex = (...(((ex/2n)2)2)...)2

conveniently Where n is an integer. Dividing by 2n is easy and accurate. Repeatedly squaring the result makes any errors much worse.

Square roots: square_root<num>

sqrt(a) is done using Newton's method:

xn+1 = ( xn + a/xn)/2

To get a good starting point, I use the following result:

x = sqrt(b * 2m)
x = sqrt(b) * 2m/2
x = sqrt(2*b) * 2(m-1)/2

Where 1 < b < 2. The approximation:

sqrt(a) = 2m/2
sqrt(a) = 2(m-1)/2

is accurate to within a factor of 2 and can be easily be computed. This serves as a good enough starting point that convergence requires only a few iterations.

Error checking for negative roots is performed in a similar manner to division. A typical error looks like:

sqrt.h: In instantiation of 'FPTemplate::sqrt_error<FPTemplate::fps<-1>, 0>':
fp_template.h:58:   instantiated from FPTemplate::square_root<FPTemplate::fps<-1> >
fp_template.h:58:   instantiated from void FPTemplate::print() [with T = FPTemplate::square_root<FPTemplate::fps<-1> >]
ex2.cxx:16:   instantiated from here
sqrt.h:53: error: no type named 'attempting_negative_square_root' in 'struct 
   FPTemplate::error_check<FPTemplate::fps<-1> >'

Interfacing with run-time

The most useful interfacing is done with the TO_FLOAT and TO_DOUBLE macros. In a positive exponent, a set first bit corresponds to a multiplication by 221, the second bit by 222, the third by 223, etc... The macro evaluates to an expression which computes this. Because all the values are determined statically, compilers (gcc, for instance) will compute the expression at compile time, resulting in a single floating point literal in the final compiled code. In certain tasks (for instance convolution), having the constants embedded in the code yields a large speed increase on modern processors, since the constants can come out of the code cache, and the data can come out of the data cache. This then effectively doubles the bandwidth in to the processor. Speed increases of nearly a factor of two have been observed.

The function print<num>() is also provided. This prints out details of the contents of the argument.


Compile time speed

In a word, atrocious. I mean bad. Really bad. So bad, that it gets utterly destroyed by a BBC micro. I know interpreted languages (and especially functional ones) are not always known for speed, but even so. This gets about 8 flops on an Opteron 2.6. That's 8 FLOPS, not MFLOPS or even KFLOPS, but real, old fashioned, single digit flops. I think that even the Z3 would give this system a run for its money.

In practice, the system is really too slow to be useful, unless you don't mind half hour compile times. Debugging under those circumstances becomes a little tedious.

Run time speed

Fantastic. This code (conv_kernel.cxx) generates a function which performs one element of a Gaussian convolution. The kernel is computed entirely at compile time, and the template recursive function calls all become inline code. The convolution is done with integer arithmetic for speed. In the annotated assembly code (gcc -O3, v. 4.0.0, Xeon), it can be seen that the convolution results in very efficient code, the data is being pulled from the data cache, and the convolution kernel comes from the code cache. Further, because the compiler knows the multiplications at compile time, it can determine whether it is faster to use the imull op-code or abuse the leal op-code in the addressing unit.

It is instructive to compare the assembly output of hand_coded.c to the output of conv_kernel.cxx, side by side. Note that the template generated kernel produces shorter code than the hand coded example.

The two (numerically identical) convolutions were tested for speed on a 2.4 GHz Xeon, with an array length of 1000, 100,000 convolutions per run, results averaged over 100 runs.
Template basedHand coded
Mean time (s)1.8281.844
Standard deviation0.01670.0243
In other words, the template generated code is about 1% faster.

The computed convolution parameters are:

3 12 36 88 181 318 474 603 653 603 474 318 181 88 36 12 3

which can be seen graphically here. The code to generate the convolution kernel takes 7 minutes to compile on a Xeon 2.4.


Not stellar, but I haven't used the best algorithms. This is more of a proof of concept than anything else. The algorithms could be replaced with better ones. Also, several algorithms are missing, such as log and the trigonometric functions. These could easily be implemented using similar techniques

Although the precision of some of operations is not all that good, the range of numbers is very large, since the exponent is a 32 bit signed integer. This allows numbers up to (232-1) * 2231-1.

Source code

What use would all this be without the code? Well that can be downloaded here and browsed here. example.cxx contains examples of all the features provided. conv_kernel.cxx is an example which contains some considerably more complex expressions.

Updated July 14th 2011, 10:52