Fast inverse-trigonometric functions in SSE

The inverse-trigonometric functions, particularly on the Pentium 4, are extremely slow; the only one in the instruction set is ATAN, and CHECK LINK the P4 Optimisation Guide (an absolutely indispensable reference for all performance-concerned programmers) claims that this takes between 150 and 300 cycles. Admittedly, it's a very general-purpose instruction, and is guaranteed to give 64 binary places of accuracy for almost arbitrarily unpleasant input values: but often you don't need that generality or that accuracy.

arccos(x)

I decided to approximate arccos, since that was the function I needed for spherical texture mapping. From the graph to the left, you see that the arccos function is nearly-symmetrical around zero, and has infinite gradient at -1 and +1. Infinite gradient isn't a good sign for approximation algorithms, so, looking at the series expansion of arccos around 1, I decided to take advantage of the RSQRTPS instruction, and approximate the much more pleasant function sqrt(2x) arccos(1-x), whose graph you see below.

 

This function is nearly linear, so doesn't need too high a degree of polynomial approximant. I tried various degrees, constructing the approximation by a least-squares fit to the function measured at 100,000 points uniformly spaced between 0 and 1, and measuring the maximum difference between the approximation and the real function at rather more points. Degree 5 turned out to be adequate, and the relevant polynomial is

apx = -0.000007239283986332 +
2.000291665285952400 * x +
0.163910606547823220 * x2+
0.047654245891495528 * x3 -
0.005516443930088506 * x4 +
0.015098965761299077 * x5

sqrt(2*x)*arccos(1-x)