Use double precision internally to optimize cbrtf(), and change the

algorithm for the second step significantly to also get a perfectly
rounded result in round-to-nearest mode.  The resulting optimization
is about 25% on Athlon64's and 30% on Athlon XP's (about 25 cycles
out of 100 on the former).

Using extra precision, we don't need to do anything special to avoid
large rounding errors in the third step (Newton's method), so we can
regroup terms to avoid a division, increase clarity, and increase
opportunities for parallelism.  Rearrangement for parallelism loses
the increase in clarity.  We end up with the same number of operations
but with a division reduced to a multiplication.

Using specifically double precision, there is enough extra precision
for the third step to give enough precision for perfect rounding to
float precision provided the previous steps are accurate to 16 bits.
(They were accurate to 12 bits, which was almost minimal for imperfect
rounding in the old version but would be more than enough for imperfect
rounding in this version (9 bits would be enough now).)  I couldn't
find any significant time optimizations from optimizing the previous
steps, so I decided to optimize for accuracy instead.  The second step
needed a division although a previous commit optimized it to use a
polynomial approximation for its main detail, and this division dominated
the time for the second step.  Use the same Newton's method for the
second step as for the third step since this is insignificantly slower
than the division plus the polynomial (now that Newton's method only
needs 1 division), significantly more accurate, and simpler.  Single
precision would be precise enough for the second step, but doesn't
have enough exponent range to handle denormals without the special
grouping of terms (as in previous versions) that requires another
division, so we use double precision for both the second and third
steps.
This commit is contained in:
Bruce Evans 2006-01-05 07:57:31 +00:00
parent 5aedf535d3
commit 4bb9780353
Notes: svn2git 2020-12-20 02:59:44 +00:00
svn path=/head/; revision=154049

View file

@ -28,16 +28,11 @@ static const unsigned
B1 = 709958130, /* B1 = (127-127.0/3-0.03306235651)*2**23 */
B2 = 642849266; /* B2 = (127-127.0/3-24/3-0.03306235651)*2**23 */
/* |1/cbrt(x) - p(x)| < 2**-14.5 (~[-4.37e-4, 4.366e-5]). */
static const float
P0 = 1.5586718321, /* 0x3fc7828f */
P1 = -0.78271341324, /* -0xbf485fe8 */
P2 = 0.22403796017; /* 0x3e656a35 */
float
cbrtf(float x)
{
float r,s,t,w;
double r,T;
float t;
int32_t hx;
u_int32_t sign;
u_int32_t high;
@ -58,27 +53,17 @@ cbrtf(float x)
} else
SET_FLOAT_WORD(t,sign|(hx/3+B1));
/* new cbrt to 14 bits */
r=(t*t)*(t/x);
t=t*((P0+r*P1)+(r*r)*P2);
/* first step Newton iteration (solving t*t-x/t == 0) to 16 bits */
/* in double precision to avoid problems with denormals */
T=t;
r=T*T*T;
T=T*(x+x+r)/(x+r+r);
/*
* Round t away from zero to 12 bits (sloppily except for ensuring that
* the result is larger in magnitude than cbrt(x) but not much more than
* 1 12-bit ulp larger). With rounding towards zero, the error bound
* would be ~5/6 instead of ~4/6, and with t 2 12-bit ulps larger the
* infinite-precision error in the Newton approximation would affect
* the second digit instead of the third digit of 4/6 = 0.666..., etc.
*/
GET_FLOAT_WORD(high,t);
SET_FLOAT_WORD(t,(high+0x1800)&0xfffff000);
/* second step Newton iteration to 47 bits */
/* in double precision for accuracy */
r=T*T*T;
T=T*(x+x+r)/(x+r+r);
/* one step Newton iteration to 24 bits with error < 0.669 ulps */
s=t*t; /* t*t is exact */
r=x/s; /* error <= 0.5 ulps; |r| < |t| */
w=t+t; /* t+t is exact */
r=(r-t)/(w+r); /* r-t is exact; w+r ~= 3*t */
t=t+t*r; /* error <= 0.5 + 0.5/3 + epsilon */
return(t);
/* rounding to 24 bits is perfect in round-to-nearest mode */
return(T);
}