Rearranged the polynomial evaluation some more to reduce dependencies.

Instead of echoing the code in a comment, try to describe why we split
up the evaluation in a special way.

The new optimization is mostly to move the evaluation of w = z*z later
so that everything else (except z = x*x) doesn't have to wait for w.
On Athlons, FP multiplication has a latency of 4 cycles so this
optimization saves 4 cycles per call provided no new dependencies are
introduced.  Tweaking the other terms in to reduce dependencies saves
a couple more cycles in some cases (more on AXP than on A64; up to 8
cycles out of 56 altogether in some cases).  The previous version had
a similar optimization for s = z*x.  Special optimizations like these
probably have a larger effect than the simple 2-way vectorization
permitted (but not activated by gcc) in the old version, since 2-way
vectorization is not enough and the polynomial's degree is so small
in the float case that non-vectorizable dependencies dominate.

On an AXP, tanf() on uniformly distributed args in [-2pi, 2pi] now
takes 34-55 cycles (was 39-59 cycles).
This commit is contained in:
Bruce Evans 2005-11-28 11:46:20 +00:00
parent e972931825
commit 1dd21062e5
Notes: svn2git 2020-12-20 02:59:44 +00:00
svn path=/head/; revision=152881

View file

@ -39,17 +39,29 @@ extern inline
float
__kernel_tandf(double x, int iy)
{
double z,r,w,s;
double z,r,w,s,t,u;
z = x*x;
w = z*z;
/* Break x^5*(T[1]+x^2*T[2]+...) into
* x^5*(T[1]+x^4*T[3]+x^8*T[5]) +
* x^5*(x^2*(T[2]+x^4*T[4]))
*/
r = (T[1]+w*(T[3]+w*T[5])) + z*(T[2]+w*T[4]);
/*
* Split up the polynomial into small independent terms to give
* opportunities for parallel evaluation. The chosen splitting is
* micro-optimized for Athlons (XP, X64). It costs 2 multiplications
* relative to Horner's method on sequential machines.
*
* We add the small terms from lowest degree up for efficiency on
* non-sequential machines (the lowest degree terms tend to be ready
* earlier). Apart from this, we don't care about order of
* operations, and don't need to to care since we have precision to
* spare. However, the chosen splitting is good for accuracy too,
* and would give results as accurate as Horner's method if the
* small terms were added from highest degree down.
*/
r = T[4]+z*T[5];
t = T[2]+z*T[3];
w = z*z;
s = z*x;
r = (x+s*T[0])+(s*z)*r;
u = T[0]+z*T[1];
r = (x+s*u)+(s*w)*(t+w*r);
if(iy==1) return r;
else return -1.0/r;
}