Fixed all 502518670 errors of more than 1 ulp for cbrtf() on amd64.

The maximum error was 3.56 ulps.

The bug was another translation error.  The double precision version
has a comment saying "new cbrt to 23 bits, may be implemented in
precision".  This means exactly what it says -- that the 23 bit second
approximation for the double precision cbrt() may be implemented in
single (i.e., float) precision.  It doesn't mean what the translation
assumed -- that this approximation, when implemented in float precision,
is good enough for the the final approximation in float precision.
First, float precision needs a 24 bit approximation.  The "23 bit"
approximation is actually good to 24 bits on float precision args, but
only if it is evaluated in double precision.  Second, the algorithm
requires a cleanup step to ensure its error bound.

In float precision, any reasonable algorithm works for the cleanup
step.  Use the same algorithm as for double precision, although this
is much more than enough and is a significant pessimization, and don't
optimize or simplify anything using double precision to implement the
float case, so that the whole double precision algorithm can be verified
in float precision.  A maximum error of 0.667 ulps is claimed for cbrt()
and the max for cbrtf() using the same algorithm shouldn't be different,
but the actual max for cbrtf() on amd64 is now 0.9834 ulps.  (On i386
-O1 the max is 0.5006 (down from < 0.7) due to extra precision.)
This commit is contained in:
Bruce Evans 2005-12-11 13:22:01 +00:00
parent 1a787460ba
commit 6de073b4ef
Notes: svn2git 2020-12-20 02:59:44 +00:00
svn path=/head/; revision=153303

View file

@ -1,5 +1,6 @@
/* s_cbrtf.c -- float version of s_cbrt.c.
* Conversion to float by Ian Lance Taylor, Cygnus Support, ian@cygnus.com.
* Debugged by Bruce D. Evans.
*/
/*
@ -37,7 +38,7 @@ G = 3.5714286566e-01; /* 5/14 = 0x3eb6db6e */
float
cbrtf(float x)
{
float r,s,t;
float r,s,t,w;
int32_t hx;
u_int32_t sign;
u_int32_t high;
@ -64,6 +65,17 @@ cbrtf(float x)
s=C+r*t;
t*=G+F/(s+E+D/s);
/* chop t to 12 bits and make it larger than cbrt(x) */
GET_FLOAT_WORD(high,t);
SET_FLOAT_WORD(t,high+0x00001000);
/* one step Newton iteration to 24 bits with error less than 0.984 ulps */
s=t*t; /* t*t is exact */
r=x/s;
w=t+t;
r=(r-t)/(w+r); /* r-t is exact */
t=t+t*r;
/* retore the sign bit */
GET_FLOAT_WORD(high,t);
SET_FLOAT_WORD(t,high|sign);