Programming tips for efficiency and problem-avoidance
Many of these will seem quite simple, though do note that I wouldn't be mentioning them if I hadn't encountered some sort of problem to be solved or found myself picking through some spurious-looking code in the past. These are also almost entirely mathematical in nature.

1. Police all input values to any ArcCos or ArcSin function to be sure that they're definitely between -1 and +1.

If there's ever any doubt, by which I mean if it is anything less than mathematically certain that your input values will be in that range, clamp them so that they will be. Here, 'clamp' means 'if the value is greater than 1 or less than -1, set it to be 1 or -1'.

A good example of this would be a mobile application reading the axis information from the device's accelerometer. While I can't speak for Apple devices, the raw data on Android does appear to be trigonometric given a plot of it at known tilt angles. The only problem was that these 'trigonometric' values were breaching 1, reaching as much as +/- 1.2 for one axis.

The idea of applying a scale factor or mapping did occur to me, though that has a couple of issues:
• One can not assume that all Android devices will use that same scale factor or mapping, or even need it at all.
• Even with such present, one still needs to clamp the value since it is a mistake to rely on the hardware functioning correctly.

The second point there is important - one's app should not crash due to a few spurious axis readings. Literally anything else would be better - clamp the value into the correct range, choose some default behaviours if the value is out of range on either side, throw an exception such that the caller can decide what to do - anything except allow it to crash.

On a similar note...

2. If it makes sense mathematically to represent something as a floating-point number between -1 and 1 (or 0 and 1), do not use a larger range in an attempt to avoid 'wasting' possible values.

Numerically, that tiny range between -1 and 1 may seem like nothing compared to the vast region of possible values a floating-point number can take - ~10308 for double-precision numbers. However, considering how floating-point numbers are represented, restricting usage to the range between -1 and 1 actually only wastes one bit. Of the 64 bits used by a double-precision number, losing 1 bit is not too bad at all, especially considering how often a 32-bit integer is used to represent stuff which may never reach as much as 10 (this would be 28 bits wasted - there's a spin-off movie title for you). In return for your implementation making mathematical sense, it's a good trade.

To see why this is the case, note that the 64 bits of your number are split across the following duties:
• 1 bit for the overall sign of the number, positive or negative.
• A few bits for the (binary) exponent, the order of magnitude in terms of powers of 2. This is a mini-integer which can itself be positive or negative for representing large or small magnitudes.
• All remaining bits (i.e. the vast majority) for the mantissa (or 'significand'), which provides the precision in terms of binary 'significant figures'.

As should be obvious with powers, when the exponent portion is zero, we are representing a number around 1.0 (actually the range 1.0 exclusive to 2.0 inclusive). All values larger than 1.0 are handled by using a zero or positive exponent, and all values 1.0 or smaller (admitting special handling for 0.0) will have a negative exponent. Therefore, the act of scrapping the range of values larger than 1.0 is the same as using only negative exponents, which is the same as fixing the 'sign bit'1 of the exponent so that it's always negative. This one bit is wasted by being fixed, but that's it - despite the modest magnitude, half of the entire set of possible floating-point numbers is contained within -1 to 1.

Of course, a number between 0.0 and 1.0 effectively also wastes the overall sign bit as well. I don't dispute that perhaps wasting 2 bits of a 32-bit float may be a problem in the case of a low-memory device which handles a vast amount of them, but the point is there, and I'm saying to make a serious valuation of that memory before going for the memory efficiency overkill.

3. Do not rely on a unit vector (of any dimension) to remain length 1 under repeated rotations.

This is a simple issue of rounding errors, and it involves the unfortunately costly fix (in terms of CPU time) of floating-point division.

The case which is acceptable is where the final position is recomputed each time by starting with a fixed vector which is known to be close enough to length 1, and performing a few (e.g. 2 or 3) rotations on it. The result won't have any opportunity to stray, thus no expensive normalisation is required, however do consider that the fixed nature of the starting point can make this susceptible to the gimbal lock effect. In the case of animating or otherwise finding some intermediate point between fixed start and end orientations, look into quaternions and the use of Spherical Linear Interpolation, amusingly known as a Slerp.

Conversely, if each rotation depends on reading back the result of the previous and applying a new incremental adjustment, normalise it. Those rounding errors hate you and are out to get you, they will compound, and they can do so in either direction (towards zero or towards infinity) depending on the rotation. I have caused some very strange things to appear on my screen due to this problem; 'mindfuck' would be stating it lightly.

Perhaps if you have a massive amount of very small rotations it may be worth testing the squared length2 of the vector and normalising it when it strays too far from 1.0, or even just limiting blindly the normalisations to e.g. one per ten rotations performed, as long as something is done to keep those results in check. The former could actually be done quite efficiently; using a grossly large tolerance as an example, suppose we want to keep the vector length within 10% of 1:
• Pre-compute and/or store the static values 1.1 (i.e. 110% of 1), its inverse, and the squares of each of those.
• When the vector is rotated, compute its squared length (e.g. x2 + y2 + z2 for a 3D vector).
• If the result is greater than the square limit, apply the inverse limit as a scale factor.
• If the result is less than the inverse square limit, apply the limit as a scale factor.
No square-roots involved!

4. Any time you have a division by a square-rooted value, use Fast Inverse Square Root.

That lil' algorithm is completely insane, most likely a result of random experimentation and discovery rather than invention, but it works.

The basic idea is sensible:
• Do something inexpensive which happens to produce a reasonable approximation or first guess at the answer.
• Run a couple of iterations of Newton-Raphson to home in on the correct result.

As stated in the article, when the result doesn't matter too much (as in the case of lighting calculations for 3D graphics, one of the most prolific consumers of inverse square-roots, where the difference is barely visible) it is even fine to use just a single iteration.

This gem is part of my standard personal toolkit, and I've done a bit of testing for accuracy against efficiency (tested against a control case of using floating-point arithmetic). Here's what I found by implementing a double-precision with a magic constant of 0x5fe6eb50c7b537a9 (where 'SF' refers to decimal significant figures):
• With 1 iteration of Newton-Raphson, the result was accurate to ~3 SF, and took ~14% of the time.
• With 2 iterations, the result was accurate to ~5 SF, and took ~23% of the time.
• With 3 iterations, the result was accurate to ~8 SF, and took ~34% of the time.
• With 4 iterations, the result was accurate to ~14 SF, and took ~47% of the time.
Note that the precision of a 64-bit floating-point number is only 15 SF; there is effectively nothing to be gained by refining the result any further, and the computation is still just over twice as fast as with regular floating-point arithmetic.

5. When working with randomly-generated numbers, pay attention to the probability distribution.

Specifically, be sure that what you're doing achieves the intended distribution, be this uniform (each outcome has an equal chance) or otherwise. The correct approach here will be case-by-case depending upon what one is trying to accomplish, so I'll list a few examples. Fair warning; this topic interests me, and I have a lot to say about it.

If you're trying to simulate rolling some 6-sided dice, a simple uniform random number from 1 to 6 (for each roll) is correct.

If you want a single function which simulates rolling two dice, a uniform random number from 2 to 12 is most certainly not correct. The eleven possible outcomes would each have an equal chance of occurring (like a flat line at 1/11), whereas the distribution is actually more like a triangular hill, with the middle result of 7 peaking at 1/6 chance, and the extreme results of 2 and 12 each at only 1/36. As more dice are added to the sum, or indeed as any random variables in general are added (subject to a few obscure conditions), the shape of the distribution gets ever-closer to looking like the famous Normal Distribution (see this image in particular) to the point where it can become faster to use that distribution to approximate a good random result. This predictable nature of summing random variables is called the Central Limit Theorem.

It is also possible, particularly in the context of floating-point numbers, to have uniform random values give you a non-uniform end result. Suppose you want to generate a pair of X-Y coordinates which represent a random point somewhere on a circle (hereafter assumed to be the unit circle with radius 1); one strategy could be to generate the X value uniformly between -1 and 1 then calculate what the corresponding Y value would be (the square-root of 1 - x2, noting that the square-root means we also need a random choice of positive or negative to capture both halves of the circle). This does indeed give points on the unit circle and is indeed random, however it is not uniform in the circular sense - the results will tend to shy away from the left and right of the circle, bunching towards the top and bottom instead.

This is because what we're doing there amounts to drawing a circle and sweeping a vertical line from left to right across it (at a constant speed) while looking at where the line intersects the circle. Note that the two intersection points begin together but diverge rapidly at first, make a leisurely stop at the top and bottom of the circle, then start converging rapidly again. The reason that we get more points at the top/bottom is the same as why the intersection in that experiment spends more time there. Clearly we would need our initial random number to be non-uniform in some way as a means of smoothing out the final result, but what function or mapping should we use?

Before continuing down that path, you may have noticed that everything becomes a lot easier if we think about it not in terms of generating one of the coordinates directly, but simply generating a uniform angle between 0Â° and 360Â° (also known as 2π radians for the purpose of feeding the angle into trigonometric functions). By basic trigonometry and indeed by definition, if our angle is θ then our X value is cos θ and our Y value is sin θ, with the added bonus of covering the entire range of the circle in one go this time. In that respect, the 'mapping' we would have needed would be a sinusoidal shape.

That said, there was a purpose to describing the former method - stepping it up to generating a uniform random point on the unit sphere, also known as a 3D unit vector. These can be useful for animating random movement in 3D, or generating a random sky-like backdrop by placing items/features in random positions. Using the idea of rotation again (i.e. using a random point on the circle then spinning that along a circle in the XZ or YZ plane) does also work and gives an equivalent result, but I would like to stick with using a mapped sweep of the newly-introduced axis, for a reason which will become clear.

So, the idea is to generate a random depth (Z coordinate) between -1 and 1, slice a flat plane through the sphere at that position, and generate a random point on the circular intersection. Of course, this time we know that it can't be uniformly between -1 and 1 since the poles of the Z axis will be starved of points; as the flat plane sweeps across the sphere, it scans the surface area much more quickly around the poles, distributing fewer points there. Instead of generating the depth directly, we can generate another uniform angle φ, then take sin φ to be our depth. We still need our random point on the circle from before, but scaled down by cos φ to match the size of the intersecting circle. Our random point is now:
• x = cos φ . cos θ
• y = cos φ . sin θ
• z = sin φ

The formulae look a bit uneven, with Z having a seemingly different treatment than X and Y, but the results are pretty good.

Speaking of quaternions earlier, that same scheme can be used again to bump the 3D result up to a 4D result; great for uniform random rotations in 3D with the twist/roll element included. This is where it is useful to avoid having to think in terms of spinning the sphere into the 4th dimension; we just need to churn out a sinusoidal value for the 4th axis (W), and scale the rest of the sphere by the cosine. With a random uniform angle ψ:
• x = cos ψ . cos φ . cos θ
• y = cos ψ . cos φ . sin θ
• z = cos ψ . sin φ
• w = sin ψ

One could also un-nest the equations a bit there by considering a quaternion to be a sort of combination of two circles in XY and ZW, however it doesn't really save much on computation3 because you still need a third angle or 'phase' to decide the weightings of each circle on the final result (cos ψ of one and sin ψ of the other):
• x = cos ψ . cos θ
• y = cos ψ . sin θ
• z = sin ψ . cos φ
• w = sin ψ . sin φ

The last point I'd like to make about generating random N-dimensional vectors in general is to describe one other dubious thing I've seen in the past - generate values for each axis uniformly between -1 and 1, then normalise the result so that the overall vector has length 1. Once again, this is random but not uniform - for example, doing this in 3D will generate vectors which cluster around the diagonals and to a lesser extent the space between them, like eight fat corners of a cube with a thin cubic frame connecting them. This is because they literally are that - we're generating random points in a cube and then just squashing down to a sphere, so the areas of the sphere which correspond to the edges and corners of the original cube have had more volume and therefore more opportunity for points to be generated there.

The author of that scheme even added a cheeky attempt at a fix for this - during the normalisation, if the vector length is greater than 1 (i.e. it lies in some edge or corner of the cube which isn't part of the enclosed sphere) then discard it and generate another. To its credit, that version does actually achieve a uniform distribution, however it is horribly inefficient to the point that it can now no longer be called an algorithm. Specifically, one requirement for something to be defined as an 'algorithm' is that it is guaranteed to finish within some finite amount of time. Naturally, a scheme based around discarding and regenerating random numbers when the overall set is beyond the desired range has the potential to get stuck in a loop forever (despite having a finite expected time which is statistically known).

There. I must persist! If I keep writing it, it will become interesting!

Finally, and for a bit of a break from the mathematical stuff...

6. In a multithreaded situation, look for opportunities to avoid having to synchronise access to objects.

It sounds like a filthy, negligent thing to be doing, but in my mind it makes sense to reduce thread blockage due to overzealous synchronisation (or if you can't be arsed).

Consider a simple example - an object which stores a reference to a callback (to be performed when some other event happens), which may be null if there is no callback. Imagine that the program contains the following:
• One worker thread, responsible for running the callback when needed.
• Another worker thread, responsible for setting/updating what the callback will be.
• Synchronised access such that the callback is never updated while it is also currently being run.
• A main thread, which sometimes checks whether a callback exists or not, and fetches some information from the callback object itself (if it does exist).

In theory, the main thread should have its access synchronised with the second4 worker thread, since any change made while the information fields are being recorded will produce inconsistent results at best (if the callback was switched for a different one) or crash at worst (if the callback was removed). In practice, you can just make a temporary copy of the object reference then operate on that while ignoring synchronisation completely - the result will be the same, and it avoids blocking the worker threads for the sake of reading a few fields. The point is that it doesn't matter if the copy is made just before the main reference is updated or removed; it acts as a faithful snapshot which you know can not be affected by the other threads.

Here's a loose example (Java-style) of what I mean:

void printCallbackInfo1()
{
if (getCallback() != null)
{
getCallback().printInfo(); // May crash if the callback is set to null after the not-null test
getCallback().printStats(); // May be inconsistent if the callback is changed during the call to printInfo
}
}

void printCallbackInfo2()
{
CallbackObject obj = getCallback(); // Use one fetch only; the saved reference can not be changed by another thread
if (obj != null)
{
obj.printInfo();
obj.printStats();
}
}

If that suggestion makes you flinch, welcome to the dark side! In all fairness, I will reconsider if given a good reason not to do the above; I just haven't been able to think of one, and I haven't encountered any problems thus far.

*

1 Signed integers don't actually have something called a 'sign bit' directly, however their implementation is such that one bit does decide the sign. As it happens, these exponents use 'binary offset' rather than 2's complement, but the 'sign bit' behaviour is functionally the same (a direct inversion, in fact).

2 If a test involving the length of a vector can be performed just as well on the squared length, use the squared length every time. Simple examination of trigonometry (specifically Pythagoras) should tell you that finding the squared length of a vector is an intermediary step to finding the length itself, the only hurdle being an expensive square-root operation. For example, if you want to know whether two points in 3D space are within 4 units of each other:
• Use subtraction to find their difference in position.
• Compute the squared length of that result (x2 + y2 + z2). Use x*x, not pow(x, 2.0).
• Test to see if that value is less than 16.
Even if the 4 is made into a variable, it is still cheaper to square it for the purpose of performing the squared length test than it is to square-root the length.

3 It looks as if we now have 4 multiplications instead of 5, however note that cos ψ . cos φ appears twice in the former set of equations. Obviously, that result can be re-used at the cost of storing it temporarily, but this matches the fact that the latter version has to store the result of sin ψ.

4 At least, that is. I'm keeping the example somewhat realistic by allowing that the main thread can at least read information from the callback object while it is being called. If we demand that reading and running should be synchronized, that is best implemented separately as part of the callback class itself.
Comments
Comment thread »
No comments!