Archive for category For Fun
For Fun: Software Square Root
Source Code
ffi.sqrt.cpp
You can find a paper on this subject in the Programming Library which can be reached from the front page of IcarusIndie.com. What the paper doesn’t cover is the relavence of this function today. Quake 3 uses software to calculate the square root rather than using the sqrt of sqrtf function provided by C++. Three versions of the square root function are given in the provided FFI code:
1 2 3 4 | float Sqrt(float x) { return sqrtf(x); } |
DevPartner Profiler doesn’t want to profile the sqrtf function directly so we stuff it in a function so it gets profiled. This technically slows the function down a bit as calling a function isn’t a free operation. The stack is accessed when the function is entered, when sqrtf is entered and twice when the functions end. You’ll see later that this isn’t a close call anyway so it doesn’t matter.
1 2 3 4 5 6 7 8 9 10 11 12 13 | float Faster_Sqrtf(float f) { float result; _asm { mov eax, f sub eax, 0x3f800000 sar eax, 1 add eax, 0x3f800000 mov result, eax } return result; } |
This code is found in a Tricks of the 3D Programming Gurus book. SAR is “Shift Arithmetic Right” which is another way of saying it does a signed division by a power of two. It basically does the following:
1 2 3 4 5 6 | eax = f eax = eax - 0x3f800000 eax = eax / 2 eax = eax + 0x3f800000 result = eax return eax |
0×3f800000 is essentially a “magic number.” Why it works isn’t entirely understood. It just works. Well, kinda but we’ll cover that a little later. Next up we have what Carmack used in Quake 3 and what other games of that time used.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | float xhalf; int i; float InvSqrt(float x) { xhalf = 0.5f*x; i = *(int*)&x; // get bits for floating value i = 0x5f3759df - (i>>1); // gives initial guess y0 x = *(float*)&i; // convert bits back to float x = x*(1.5f-xhalf*x*x); // Newton step, repeating increases accuracy // x = x*(1.5f-xhalf*x*x); // Newton step, repeating increases accuracy // x = x*(1.5f-xhalf*x*x); // Newton step, repeating increases accuracy return 1.0f/x; } |
You’ll have to read the paper if you want a detailed analysis of this function. In my own testing I found that repeating the last line a couple times was as accurate as it was going to get. Since we only have 32 bits to work with, eventually you just run out of accuracy. With only one Newton step you were likely to get no decimal accuracy. With two, you would get maybe one or two decimals of accuracy. And with three you would get at most 6 decimals of accuracy. And of course with each additional step you end up taking longer to complete the function. I do the 1/x at the end so we return the actual square root and not the inverse. This makes it easier to compare results between sqrtf and this function. It also doesn’t add a significant amount to the cycle count to complete the function.
Quake 3: Arena was released on November 30th, 1999 which is nearly 5 years ago at the time of this writing. The minimum specs listed on ID’s web-site for Quake 3 are a “Pentium II® 300 Mhz or AMD® 350 Mhz K6®-2 processor or Athlon® processor.” It wasn’t until the Pentium III that SSE was introduced. Which was the same year that Quake 3 was released but it was bleeding edge. It wasn’t until later processors that things like the square root became more efficiently calculated in hardware. Until then, it was actually faster to use the Inverse Square Root function.
If you’re interested in seeing what functions instruction sets various CPUs added contain you can visit CPU ID. You can see that AMD added hardware square root with 3DNow which was created in response to Intel’s SSE which also was the first to have a hardware square root. SSE 2 has a double precision square root function.
Let’s take a look at the numbers collected on a Duron 1.2 Ghz system which has 3DNow.
1 2 3 4 5 6 | First the FasterSqrtf function used by the Tricks book. Sqrt - 330 FasterSqrtf - 426 The numbers are CPU cycles. The lower the number, the faster it is. |
As you can see, the Faster_Sqrtf is actually quite a bit slower.
1 2 3 4 5 | InvSqrt Sqrtf high accuracy 448 187 medium accuracy 386 247 low accuracy 349 226 low accuracy inv 337 226 |
The last entry is doesn’t do the 1/x. You can see that not many cycles were gained by returning the inverse of the square root. The built in sqrtf was also significantly faster either way. One final test was then run in Release Mode:
1 2 3 4 | InvSqrt 428 cycles InvSqrt* 321 cycles Faster_Sqrtf 267 cycles Sqrtf 107 cycles |
Finding the square root in software just slows things down now. The InvSqrt* indicates the test used the code as is. Only one Newton step and the result wasn’t inversed before being returned. That shaved off a number of cycles but still took three times the cycles of the built in sqrtf function.
Faster_Sqrtf is so inaccurate as it is that I can’t recommend using it anywhere for any reason. InvSqrt is still a relavent function if you’re not programming on a system which has the square root function built in hardware. But, if you’re programming on a modern PC (SSE or 3DNow! minimum) then you’re better off just using the built in sqrtf function and not worrying about a faster way to do it. It quite simply doesn’t exist.
As you can see in the code, all the tests were done by generating 10,000 random floating point numbers and passing the same numbers into all the square root functions. The results are then the average cycle counts to complete the function.
And that concludes this look at different ways to find the square root of a number.
For Fun: Binary Addition
Source Code
ffi.binaryadd.cpp
Functionally speaking, this is a pointless exercise. All we’re doing is adding two numbers without using + or -. Conceptually speaking, this is an excellent way to learn about binary operators which you will be using in your code. C++ has 6 binary operators:
1 2 3 4 5 6 | & - AND | - OR ^ - XOR ~ - NOT >> - shift right << - shift left |
And each of these has a logical counterpart
1 2 3 4 5 6 | && - AND || - OR ^ - XOR ! - NOT > - greater than < - less than |
Binary operators work on the bit level while logical operators compare expressions.
1 2 3 4 5 6 7 8 9 10 11 12 | //bitwise operator char a,b,c; a = 0; b = 1; c = a | b; //c will equal 1 //logical operator if(a || b) c=1; else c=0; |
We get the same result. We just go about it a different way. The first thing we do in our code is define OR, AND, XOR and NOT using only true and false. We could just use the given C++ binary operators but that’s not the point of this exercise. The point is to show you how these functions work under the surface from a logical standpoint.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | bool OR(bool a, bool b) { if(a) return true; if(b) return true; return false; } bool AND(bool a, bool b) { if(a) if(b) return true; return false; } bool XOR(bool a, bool b) { if(a) { if(b) return false; return true; } if(b) return true; return false; } bool NOT(bool a) { if(a) return false; return true; } |
I actually had to implement this in Scheme for a college course so don’t act too surprised if you find this on a homework assignment for a Computer Science course. Now that we have our “logic gates” (as they’re called in digital circuits) it’s time to do something with them.
1 2 3 4 | char Half_Adder(bool a, bool b) { return (XOR(a,b)<<1 | AND(a,b)); } |
A “half adder” adds two bit together and returns a carry bit and a sum bit. If you add 1 and 1 in binary you get 10 which is 2 in decimal. 1 is carried over to the next bit and the sum is 0 for the current bit. The reason we’re using a char is because we need to return 2 bits and the easiest way to do that is to just set two bit in a byte to our value. The integer value of this function will either be 0, 1, 2 or 3. We use the C++ binary operator OR to combine the two bit values. If we were to use & instead we would only be able to return 0 or 3 since both bits would have to be 1 to get a 1, otherwise it would be 0. We used this idea in the JavaScript tutorials when packing pixels into a byte. That’s an immediate example of why this is good stuff to know.
1 2 3 4 | char Full_Adder(bool a, bool b, bool c) { return (XOR(XOR(a,b),c))<<1 | (OR(OR(AND(a,b),AND(c,a)),AND(c,b))); } |
You can see that the full adder is essentially the same but the number of operations increases dramatically. This is because binary systems can only operate on two values at once so we have to use the output from one system as an input to another when operating on more than two values. If you’d like to see the circuit diagram for this function you can go to the OpenTama section of Homebrew which is accessible from the IcarusIndie.com front page. The schemetics for OpenTama contain a full adder circuit diagram.
Our first section of code in the main() function passes all 8 possible binary combinations into the Full_Adder function to test its functionality. Next we use the full adder to add two 32-bit numbers together.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | //num1, num2, pos and final must be the same //number of bits unsigned int num1 = 1234567; unsigned int num2 = 7654321; unsigned int final = 0; //must be zero or //answer will be wrong unsigned int pos = 1; //must be odd number to start //temp var to store result of adder char res = 0; //must be defaulted to zero //or answer will be off by res%2 while(pos!=0) { res = Full_Adder(num1 & 1, num2 & 1, res & 1); num1 >>= 1; num2 >>= 1; final = (final<<1) | (bool)(res & 2); pos <<=1; } final = RevBits(final); printf("%i\n", final); |
num1 and num2 are the two numbers we will be adding. final is the final result of the addition and pos is our counter variable. Since we can’t use + we can’t just use a for loop. So instead, we start pos off at 1 and then shift the bit left 1 each iteration. After 32 iterations the bit will be chopped off the number and pos will be 0 and so we know we’re done. While pos is being shifted left, num1 and num2 are being shifted right and we keep grabbing the least significant bit. We start res off at 0 which signifies no carry and no sum. We then pass the least significant value of both num1 and num2 along with the current carry value to the full adder. That gives us the sum and carry for those two bits. We append the sum bit to the final value and pass the carry into the next iteration. The final result is the two numbers added but the bits are reversed. The least significant bit of the answer is set as the most significant bit. So we call a RevBits function to swap it around for us.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | unsigned int RevBits ( unsigned int val ) { unsigned int ret = 0; unsigned int b = 1; while (b != 0) { ret = ( ret << 1 ) | ( val & 1 ); val >>= 1; b <<= 1; } return ret; } |
You can see we do the same loop counter trick and the same binary operations to get the bits swapped. We then return the result and we’re done.
And that concludes this FFI on binary addition. You may never have to create a binary adder in software but even if you don’t, you still need to know how these operations work.