[ACCEPTED]-Optimizations for pow() with const non-integer exponent?-exponent

Accepted answer
Score: 34

Another answer because this is very different 126 from my previous answer, and this is blazing 125 fast. Relative error is 3e-8. Want more 124 accuracy? Add a couple more Chebychev terms. It's 123 best to keep the order odd as this makes 122 for a small discontinuity between 2^n-epsilon 121 and 2^n+epsilon.

#include <stdlib.h>
#include <math.h>

// Returns x^(5/12) for x in [1,2), to within 3e-8 (relative error).
// Want more precision? Add more Chebychev polynomial coefs.
double pow512norm (
   double x)
{
   static const int N = 8;

   // Chebychev polynomial terms.
   // Non-zero terms calculated via
   //   integrate (2/pi)*ChebyshevT[n,u]/sqrt(1-u^2)*((u+3)/2)^(5/12)
   //   from -1 to 1
   // Zeroth term is similar except it uses 1/pi rather than 2/pi.
   static const double Cn[N] = { 
       1.1758200232996901923,
       0.16665763094889061230,
      -0.0083154894939042125035,
       0.00075187976780420279038,
      // Wolfram alpha doesn't want to compute the remaining terms
      // to more precision (it times out).
      -0.0000832402,
       0.0000102292,
      -1.3401e-6,
       1.83334e-7};

   double Tn[N];

   double u = 2.0*x - 3.0;

   Tn[0] = 1.0;
   Tn[1] = u;
   for (int ii = 2; ii < N; ++ii) {
      Tn[ii] = 2*u*Tn[ii-1] - Tn[ii-2];
   }   

   double y = 0.0;
   for (int ii = N-1; ii >= 0; --ii) {
      y += Cn[ii]*Tn[ii];
   }   

   return y;
}


// Returns x^(5/12) to within 3e-8 (relative error).
double pow512 (
   double x)
{
   static const double pow2_512[12] = {
      1.0,
      pow(2.0, 5.0/12.0),
      pow(4.0, 5.0/12.0),
      pow(8.0, 5.0/12.0),
      pow(16.0, 5.0/12.0),
      pow(32.0, 5.0/12.0),
      pow(64.0, 5.0/12.0),
      pow(128.0, 5.0/12.0),
      pow(256.0, 5.0/12.0),
      pow(512.0, 5.0/12.0),
      pow(1024.0, 5.0/12.0),
      pow(2048.0, 5.0/12.0)
   };

   double s;
   int iexp;

   s = frexp (x, &iexp);
   s *= 2.0;
   iexp -= 1;

   div_t qr = div (iexp, 12);
   if (qr.rem < 0) {
      qr.quot -= 1;
      qr.rem += 12;
   }

   return ldexp (pow512norm(s)*pow2_512[qr.rem], 5*qr.quot);
}

Addendum: What's going on here?
Per request, the following 120 explains how the above code works.

Overview
The above 119 code defines two functions, double pow512norm (double x) and double pow512 (double x). The latter 118 is the entry point to the suite; this is 117 the function that user code should call 116 to calculate x^(5/12). The function pow512norm(x) uses 115 Chebyshev polynomials to approximate x^(5/12), but 114 only for x in the range [1,2]. (Use pow512norm(x) for 113 values of x outside that range and the result 112 will be garbage.)

The function pow512(x) splits the 111 incoming x into a pair (double s, int n) such that x = s * 2^n and such 110 that 1≤s<2. A further partitioning of 109 n into (int q, unsigned int r) such that n = 12*q + r and r is less than 12 lets 108 me split the problem of finding x^(5/12) into 107 parts:

  1. x^(5/12)=(s^(5/12))*((2^n)^(5/12)) via (uv)^a=(u^a)(v^a) for positive u,v and real a.
  2. s^(5/12) is calculated via pow512norm(s).
  3. (2^n)^(5/12)=(2^(12*q+r))^(5/12) via substitution.
  4. 2^(12*q+r)=(2^(12*q))*(2^r) via u^(a+b)=(u^a)*(u^b) for positive u, real a,b.
  5. (2^(12*q+r))^(5/12)=(2^(5*q))*((2^r)^(5/12)) via some more manipulations.
  6. (2^r)^(5/12) is calculated by the lookup table pow2_512.
  7. Calculate pow512norm(s)*pow2_512[qr.rem] and we're almost there. Here qr.rem is the r value calculated in step 3 above. All that is needed is to multiply this by 2^(5*q) to yield the desired result.
  8. That is exactly what the math library function ldexp does.

Function Approximation
The goal here is to come up with 106 an easily computable approximation of f(x)=x^(5/12) that 105 is 'good enough' for the problem at hand. Our 104 approximation should be close to f(x) in 103 some sense. Rhetorical question: What does 102 'close to' mean? Two competing interpretations 101 are minimizing the mean square error versus 100 minimizing the maximum absolute error.

I'll 99 use a stock market analogy to describe the 98 difference between these. Suppose you want 97 to save for your eventual retirement. If 96 you are in your twenties, the best thing 95 to do is to invest in stocks or stock market 94 funds. This is because over a long enough 93 span of time, the stock market on average 92 beats any other investment scheme. However, we've 91 all seen times when putting money into stocks 90 is a very bad thing to do. If you are in 89 your fifties or sixties (or forties if you 88 want to retire young) you need to invest 87 a bit more conservatively. Those downswings 86 can wreak have on your retirement portfolio.

Back 85 to function approximation: As the consumer 84 of some approximation, you are typically 83 worried about the worst-case error rather 82 than the performance "on average". Use 81 some approximation constructed to give the 80 best performance "on average" (e.g. least 79 squares) and Murphy's law dictates that 78 your program will spend a whole lot of time 77 using the approximation exactly where the 76 performance is far worse than average. What 75 you want is a minimax approximation, something 74 that minimizes the maximum absolute error 73 over some domain. A good math library will 72 take a minimax approach rather than a least 71 squares approach because this lets the authors 70 of the math library give some guaranteed 69 performance of their library.

Math libraries 68 typically use a polynomial or a rational 67 polynomial to approximate some function 66 f(x) over some domain a≤x≤b. Suppose the 65 function f(x) is analytic over this domain 64 and you want to approximate the function 63 by some polynomial p(x) of degree N. For 62 a given degree N there exists some magical, unique 61 polynomial p(x) such that p(x)-f(x) has 60 N+2 extrema over [a,b] and such that the 59 absolute values of these N+2 extrema are 58 all equal to one another. Finding this magical 57 polynomial p(x) is the holy grail of function 56 approximators.

I did not find that holy grail 55 for you. I instead used a Chebyshev approximation. The 54 Chebyshev polynomials of the first kind 53 are an orthogonal (but not orthonormal) set 52 of polynomials with some very nice features 51 when it comes to function approximation. The 50 Chebyshev approximation oftentimes is very 49 close to that magical polynomial p(x). (In 48 fact, the Remez exchange algorithm that 47 does find that holy grail polynomial typically 46 starts with a Chebyshev approximation.)

pow512norm(x)
This 45 function uses Chebyshev approximation to 44 find some polynomial p*(x) that approximates 43 x^(5/12). Here I'm using p*(x) to distinguish 42 this Chebyshev approximation from the magical 41 polynomial p(x) described above. The Chebyshev 40 approximation p*(x) is easy to find; finding 39 p(x) is a bear. The Chebyshev approximation 38 p*(x) is sum_i Cn[i]*Tn(i,x), where the 37 Cn[i] are the Chebyshev coefficients and 36 Tn(i,x) are the Chebyshev polynomials evaluated 35 at x.

I used Wolfram alpha to find the Chebyshev 34 coefficients Cn for me. For example, this calculates Cn[1]. The 33 first box after the input box has the desired 32 answer, 0.166658 in this case. That's not 31 as many digits as I would like. Click on 30 'more digits' and voila, you get a whole 29 lot more digits. Wolfram alpha is free; there 28 is a limit on how much computation it will 27 do. It hits that limit on higher order terms. (If 26 you buy or have access to mathematica you 25 will be able to calculate those high-order 24 coefficients to a high degree of precision.)

The 23 Chebyshev polynomials Tn(x) are calculated 22 in the array Tn. Beyond giving something very 21 close to magical polynomial p(x), another 20 reason for using Chebyshev approximation 19 is that the values of those Chebyshev polynomials 18 are easily calculated: Start with Tn[0]=1 and Tn[1]=x, and 17 then iteratively calculate Tn[i]=2*x*Tn[i-1] - Tn[i-2]. (I used 'ii' as 16 the index variable rather than 'i' in my 15 code. I never use 'i' as a variable name. How 14 many words in the English language have 13 an 'i' in the word? How many have two consecutive 12 'i's?)

pow512(x)
pow512 is the function that user code should 11 be calling. I already described the basics 10 of this function above. A few more details: The 9 math library function frexp(x) returns the significand 8 s and exponent iexp for the input x. (Minor issue: I 7 want s between 1 and 2 for use with pow512norm but 6 frexp returns a value between 0.5 and 1.) The 5 math library function div returns the quotient 4 and remainder for integer division in one 3 swell foop. Finally, I use the math library 2 function ldexp to put the three parts together 1 to form the final answer.

Score: 24

In the IEEE 754 hacking vein, here is another 70 solution which is faster and less "magical." It 69 achieves an error margin of .08% in about 68 a dozen clock cycles (for the case of p=2.4, on 67 an Intel Merom CPU).

Floating point numbers 66 were originally invented as an approximation 65 to logarithms, so you can use the integer 64 value as an approximation of log2. This is somewhat-portably 63 achievable by applying the convert-from-integer 62 instruction to a floating-point value, to 61 obtain another floating-point value.

To complete 60 the pow computation, you can multiply by a 59 constant factor and convert the logarithm 58 back with the convert-to-integer instruction. On 57 SSE, the relevant instructions are cvtdq2ps and 56 cvtps2dq.

It's not quite so simple, though. The exponent 55 field in IEEE 754 is signed, with a bias 54 value of 127 representing an exponent of 53 zero. This bias must be removed before you 52 multiply the logarithm, and re-added before 51 you exponentiate. Furthermore, bias adjustment 50 by subtraction won't work on zero. Fortunately, both 49 adjustments can be achieved by multiplying 48 by a constant factor beforehand.

x^p
= exp2( p * log2( x ) )
= exp2( p * ( log2( x ) + 127 - 127 ) - 127 + 127 )
= cvtps2dq( p * ( log2( x ) + 127 - 127 - 127 / p ) )
= cvtps2dq( p * ( log2( x ) + 127 - log2( exp2( 127 - 127 / p ) ) )
= cvtps2dq( p * ( log2( x * exp2( 127 / p - 127 ) ) + 127 ) )
= cvtps2dq( p * ( cvtdq2ps( x * exp2( 127 / p - 127 ) ) ) )

exp2( 127 / p - 127 ) is the 47 constant factor. This function is rather 46 specialized: it won't work with small fractional 45 exponents, because the constant factor grows 44 exponentially with the inverse of the exponent 43 and will overflow. It won't work with negative 42 exponents. Large exponents lead to high 41 error, because the mantissa bits are mingled 40 with the exponent bits by the multiplication.

But, it's 39 just 4 fast instructions long. Pre-multiply, convert 38 from "integer" (to logarithm), power-multiply, convert 37 to "integer" (from logarithm). Conversions 36 are very fast on this implementation of 35 SSE. We can also squeeze an extra constant 34 coefficient into the first multiplication.

template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden >
__m128 fastpow( __m128 arg ) {
        __m128 ret = arg;
//      std::printf( "arg = %,vg\n", ret );
        // Apply a constant pre-correction factor.
        ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. )
                * pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) );
//      std::printf( "scaled = %,vg\n", ret );
        // Reinterpret arg as integer to obtain logarithm.
        asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) );
//      std::printf( "log = %,vg\n", ret );
        // Multiply logarithm by power.
        ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) );
//      std::printf( "powered = %,vg\n", ret );
        // Convert back to "integer" to exponentiate.
        asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) );
//      std::printf( "result = %,vg\n", ret );
        return ret;
}

A 33 few trials with exponent = 2.4 show this 32 consistently overestimates by about 5%. (The 31 routine is always guaranteed to overestimate.) You 30 could simply multiply by 0.95, but a few 29 more instructions will get us about 4 decimal 28 digits of accuracy, which should be enough 27 for graphics.

The key is to match the overestimate 26 with an underestimate, and take the average.

  • Compute x^0.8: four instructions, error ~ +3%.
  • Compute x^-0.4: one rsqrtps. (This is quite accurate enough, but does sacrifice the ability to work with zero.)
  • Compute x^0.4: one mulps.
  • Compute x^-0.2: one rsqrtps.
  • Compute x^2: one mulps.
  • Compute x^3: one mulps.
  • x^2.4 = x^2 * x^0.4: one mulps. This is the overestimate.
  • x^2.4 = x^3 * x^-0.4 * x^-0.2: two mulps. This is the underestimate.
  • Average the above: one addps, one mulps.

Instruction 25 tally: fourteen, including two conversions 24 with latency = 5 and two reciprocal square 23 root estimates with throughput = 4.

To properly 22 take the average, we want to weight the 21 estimates by their expected errors. The 20 underestimate raises the error to a power 19 of 0.6 vs 0.4, so we expect it to be 1.5x 18 as erroneous. Weighting doesn't add any 17 instructions; it can be done in the pre-factor. Calling 16 the coefficient a: a^0.5 = 1.5 a^-0.75, and 15 a = 1.38316186.

The final error is about 14 .015%, or 2 orders of magnitude better than 13 the initial fastpow result. The runtime is about 12 a dozen cycles for a busy loop with volatile source 11 and destination variables… although it's 10 overlapping the iterations, real-world usage 9 will also see instruction-level parallelism. Considering 8 SIMD, that's a throughput of one scalar 7 result per 3 cycles!

int main() {
        __m128 const x0 = _mm_set_ps( 0.01, 1, 5, 1234.567 );
        std::printf( "Input: %,vg\n", x0 );

        // Approx 5% accuracy from one call. Always an overestimate.
        __m128 x1 = fastpow< 24, 10, 1, 1 >( x0 );
        std::printf( "Direct x^2.4: %,vg\n", x1 );

        // Lower exponents provide lower initial error, but too low causes overflow.
        __m128 xf = fastpow< 8, 10, int( 1.38316186 * 1e9 ), int( 1e9 ) >( x0 );
        std::printf( "1.38 x^0.8: %,vg\n", xf );

        // Imprecise 4-cycle sqrt is still far better than fastpow, good enough.
        __m128 xfm4 = _mm_rsqrt_ps( xf );
        __m128 xf4 = _mm_mul_ps( xf, xfm4 );

        // Precisely calculate x^2 and x^3
        __m128 x2 = _mm_mul_ps( x0, x0 );
        __m128 x3 = _mm_mul_ps( x2, x0 );

        // Overestimate of x^2 * x^0.4
        x2 = _mm_mul_ps( x2, xf4 );

        // Get x^-0.2 from x^0.4. Combine with x^-0.4 into x^-0.6 and x^2.4.
        __m128 xfm2 = _mm_rsqrt_ps( xf4 );
        x3 = _mm_mul_ps( x3, xfm4 );
        x3 = _mm_mul_ps( x3, xfm2 );

        std::printf( "x^2 * x^0.4: %,vg\n", x2 );
        std::printf( "x^3 / x^0.6: %,vg\n", x3 );
        x2 = _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 ) );
        // Final accuracy about 0.015%, 200x better than x^0.8 calculation.
        std::printf( "average = %,vg\n", x2 );
}

Well… sorry I wasn't 6 able to post this sooner. And extending 5 it to x^1/2.4 is left as an exercise ;v) .


Update with stats

I 4 implemented a little test harness and two 3 x(512) cases corresponding to the above.

#include <cstdio>
#include <xmmintrin.h>
#include <cmath>
#include <cfloat>
#include <algorithm>
using namespace std;

template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden >
__m128 fastpow( __m128 arg ) {
    __m128 ret = arg;
//  std::printf( "arg = %,vg\n", ret );
    // Apply a constant pre-correction factor.
    ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. )
        * pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) );
//  std::printf( "scaled = %,vg\n", ret );
    // Reinterpret arg as integer to obtain logarithm.
    asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) );
//  std::printf( "log = %,vg\n", ret );
    // Multiply logarithm by power.
    ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) );
//  std::printf( "powered = %,vg\n", ret );
    // Convert back to "integer" to exponentiate.
    asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) );
//  std::printf( "result = %,vg\n", ret );
    return ret;
}

__m128 pow125_4( __m128 arg ) {
    // Lower exponents provide lower initial error, but too low causes overflow.
    __m128 xf = fastpow< 4, 5, int( 1.38316186 * 1e9 ), int( 1e9 ) >( arg );

    // Imprecise 4-cycle sqrt is still far better than fastpow, good enough.
    __m128 xfm4 = _mm_rsqrt_ps( xf );
    __m128 xf4 = _mm_mul_ps( xf, xfm4 );

    // Precisely calculate x^2 and x^3
    __m128 x2 = _mm_mul_ps( arg, arg );
    __m128 x3 = _mm_mul_ps( x2, arg );

    // Overestimate of x^2 * x^0.4
    x2 = _mm_mul_ps( x2, xf4 );

    // Get x^-0.2 from x^0.4, and square it for x^-0.4. Combine into x^-0.6.
    __m128 xfm2 = _mm_rsqrt_ps( xf4 );
    x3 = _mm_mul_ps( x3, xfm4 );
    x3 = _mm_mul_ps( x3, xfm2 );

    return _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 * 0.9999 ) );
}

__m128 pow512_2( __m128 arg ) {
    // 5/12 is too small, so compute the sqrt of 10/12 instead.
    __m128 x = fastpow< 5, 6, int( 0.992245 * 1e9 ), int( 1e9 ) >( arg );
    return _mm_mul_ps( _mm_rsqrt_ps( x ), x );
}

__m128 pow512_4( __m128 arg ) {
    // 5/12 is too small, so compute the 4th root of 20/12 instead.
    // 20/12 = 5/3 = 1 + 2/3 = 2 - 1/3. 2/3 is a suitable argument for fastpow.
    // weighting coefficient: a^-1/2 = 2 a; a = 2^-2/3
    __m128 xf = fastpow< 2, 3, int( 0.629960524947437 * 1e9 ), int( 1e9 ) >( arg );
    __m128 xover = _mm_mul_ps( arg, xf );

    __m128 xfm1 = _mm_rsqrt_ps( xf );
    __m128 x2 = _mm_mul_ps( arg, arg );
    __m128 xunder = _mm_mul_ps( x2, xfm1 );

    // sqrt2 * over + 2 * sqrt2 * under
    __m128 xavg = _mm_mul_ps( _mm_set1_ps( 1/( 3 * 0.629960524947437 ) * 0.999852 ),
                                _mm_add_ps( xover, xunder ) );

    xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) );
    xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) );
    return xavg;
}

__m128 mm_succ_ps( __m128 arg ) {
    return (__m128) _mm_add_epi32( (__m128i) arg, _mm_set1_epi32( 4 ) );
}

void test_pow( double p, __m128 (*f)( __m128 ) ) {
    __m128 arg;

    for ( arg = _mm_set1_ps( FLT_MIN / FLT_EPSILON );
            ! isfinite( _mm_cvtss_f32( f( arg ) ) );
            arg = mm_succ_ps( arg ) ) ;

    for ( ; _mm_cvtss_f32( f( arg ) ) == 0;
            arg = mm_succ_ps( arg ) ) ;

    std::printf( "Domain from %g\n", _mm_cvtss_f32( arg ) );

    int n;
    int const bucket_size = 1 << 25;
    do {
        float max_error = 0;
        double total_error = 0, cum_error = 0;
        for ( n = 0; n != bucket_size; ++ n ) {
            float result = _mm_cvtss_f32( f( arg ) );

            if ( ! isfinite( result ) ) break;

            float actual = ::powf( _mm_cvtss_f32( arg ), p );

            float error = ( result - actual ) / actual;
            cum_error += error;
            error = std::abs( error );
            max_error = std::max( max_error, error );
            total_error += error;

            arg = mm_succ_ps( arg );
        }

        std::printf( "error max = %8g\t" "avg = %8g\t" "|avg| = %8g\t" "to %8g\n",
                    max_error, cum_error / n, total_error / n, _mm_cvtss_f32( arg ) );
    } while ( n == bucket_size );
}

int main() {
    std::printf( "4 insn x^12/5:\n" );
    test_pow( 12./5, & fastpow< 12, 5, 1059, 1000 > );
    std::printf( "14 insn x^12/5:\n" );
    test_pow( 12./5, & pow125_4 );
    std::printf( "6 insn x^5/12:\n" );
    test_pow( 5./12, & pow512_2 );
    std::printf( "14 insn x^5/12:\n" );
    test_pow( 5./12, & pow512_4 );
}

Output:

4 insn x^12/5:
Domain from 1.36909e-23
error max =      inf    avg =      inf  |avg| =      inf    to 8.97249e-19
error max =  2267.14    avg =  139.175  |avg| =  139.193    to 5.88021e-14
error max = 0.123606    avg = -0.000102963  |avg| = 0.0371122   to 3.85365e-09
error max = 0.123607    avg = -0.000108978  |avg| = 0.0368548   to 0.000252553
error max =  0.12361    avg = 7.28909e-05   |avg| = 0.037507    to  16.5513
error max = 0.123612    avg = -0.000258619  |avg| = 0.0365618   to 1.08471e+06
error max = 0.123611    avg = 8.70966e-05   |avg| = 0.0374369   to 7.10874e+10
error max =  0.12361    avg = -0.000103047  |avg| = 0.0371122   to 4.65878e+15
error max = 0.123609    avg =      nan  |avg| =      nan    to 1.16469e+16
14 insn x^12/5:
Domain from 1.42795e-19
error max =      inf    avg =      nan  |avg| =      nan    to 9.35823e-15
error max = 0.000936462 avg = 2.0202e-05    |avg| = 0.000133764 to 6.13301e-10
error max = 0.000792752 avg = 1.45717e-05   |avg| = 0.000129936 to 4.01933e-05
error max = 0.000791785 avg = 7.0132e-06    |avg| = 0.000129923 to  2.63411
error max = 0.000787589 avg = 1.20745e-05   |avg| = 0.000129347 to   172629
error max = 0.000786553 avg = 1.62351e-05   |avg| = 0.000132397 to 1.13134e+10
error max = 0.000785586 avg = 8.25205e-06   |avg| = 0.00013037  to 6.98147e+12
6 insn x^5/12:
Domain from 9.86076e-32
error max = 0.0284339   avg = 0.000441158   |avg| = 0.00967327  to 6.46235e-27
error max = 0.0284342   avg = -5.79938e-06  |avg| = 0.00897913  to 4.23516e-22
error max = 0.0284341   avg = -0.000140706  |avg| = 0.00897084  to 2.77556e-17
error max = 0.028434    avg = 0.000440504   |avg| = 0.00967325  to 1.81899e-12
error max = 0.0284339   avg = -6.11153e-06  |avg| = 0.00897915  to 1.19209e-07
error max = 0.0284298   avg = -0.000140597  |avg| = 0.00897084  to 0.0078125
error max = 0.0284371   avg = 0.000439748   |avg| = 0.00967319  to      512
error max = 0.028437    avg = -7.74294e-06  |avg| = 0.00897924  to 3.35544e+07
error max = 0.0284369   avg = -0.000142036  |avg| = 0.00897089  to 2.19902e+12
error max = 0.0284368   avg = 0.000439183   |avg| = 0.0096732   to 1.44115e+17
error max = 0.0284367   avg = -7.41244e-06  |avg| = 0.00897923  to 9.44473e+21
error max = 0.0284366   avg = -0.000141706  |avg| = 0.00897088  to 6.1897e+26
error max = 0.485129    avg = -0.0401671    |avg| = 0.048422    to 4.05648e+31
error max = 0.994932    avg = -0.891494 |avg| = 0.891494    to 2.65846e+36
error max = 0.999329    avg =      nan  |avg| =      nan    to       -0
14 insn x^5/12:
Domain from 2.64698e-23
error max =  0.13556    avg = 0.00125936    |avg| = 0.00354677  to 1.73472e-18
error max = 0.000564988 avg = 2.51458e-06   |avg| = 0.000113709 to 1.13687e-13
error max = 0.000565065 avg = -1.49258e-06  |avg| = 0.000112553 to 7.45058e-09
error max = 0.000565143 avg = 1.5293e-06    |avg| = 0.000112864 to 0.000488281
error max = 0.000565298 avg = 2.76457e-06   |avg| = 0.000113713 to       32
error max = 0.000565453 avg = -1.61276e-06  |avg| = 0.000112561 to 2.09715e+06
error max = 0.000565531 avg = 1.42628e-06   |avg| = 0.000112866 to 1.37439e+11
error max = 0.000565686 avg = 2.71505e-06   |avg| = 0.000113715 to 9.0072e+15
error max = 0.000565763 avg = -1.56586e-06  |avg| = 0.000112415 to 1.84467e+19

I 2 suspect accuracy of the more accurate 5/12 1 is being limited by the rsqrt operation.

Score: 20

Ian Stephenson wrote this code which he claims outperforms 19 pow(). He describes the idea as follows:

Pow is basically implemented 18 using log's: pow(a,b)=x(logx(a)*b). so we need a fast log 17 and fast exponent - it doesn't matter 16 what x is so we use 2. The trick is that 15 a floating point number is already in 14 a log style format:

a=M*2E

Taking the log of both 13 sides gives:

log2(a)=log2(M)+E

or more simply:

log2(a)~=E

In other words 12 if we take the floating point representation 11 of a number, and extract the Exponent 10 we've got something that's a good starting 9 point as its log. It turns out that when 8 we do this by massaging the bit patterns, the 7 Mantissa ends up giving a good approximation 6 to the error, and it works pretty well.

This 5 should be good enough for simple lighting 4 calculations, but if you need something 3 better, you can then extract the Mantissa, and 2 use that to calculate a quadratic correction 1 factor which is pretty accurate.

Score: 17

First off, using floats isn't going to buy 5 much on most machines nowadays. In fact, doubles 4 can be faster. Your power, 1.0/2.4, is 5/12 3 or 1/3*(1+1/4). Even though this is calling 2 cbrt (once) and sqrt (twice!) it is still 1 twice as fast as using pow(). (Optimization: -O3, compiler: i686-apple-darwin10-g++-4.2.1).

#include <math.h> // cmath does not provide cbrt; C99 does.
double xpow512 (double x) {
  double cbrtx = cbrt(x);
  return cbrtx*sqrt(sqrt(cbrtx));
}
Score: 15

This might not answer your question.

The 21 2.4f and 1/2.4f make me very suspicious, because those 20 are exactly the powers used to convert between 19 sRGB and a linear RGB color space. So you 18 might actually be trying to optimize that, specifically. I 17 don't know, which is why this might not 16 answer your question.

If this is the case, try 15 using a lookup table. Something like:

__attribute__((aligned(64))
static const unsigned short SRGB_TO_LINEAR[256] = { ... };
__attribute__((aligned(64))
static const unsigned short LINEAR_TO_SRGB[256] = { ... };

void apply_lut(const unsigned short lut[256], unsigned char *src, ...

If 14 you are using 16-bit data, change as appropriate. I 13 would make the table 16 bits anyway so you 12 can dither the result if necessary when 11 working with 8-bit data. This obviously 10 won't work very well if your data is floating 9 point to begin with -- but it doesn't really 8 make sense to store sRGB data in floating 7 point, so you might as well convert to 16-bit 6 / 8-bit first and then do the conversion 5 from linear to sRGB.

(The reason sRGB doesn't 4 make sense as floating point is that HDR 3 should be linear, and sRGB is only convenient 2 for storing on disk or displaying on screen, but 1 not convenient for manipulation.)

Score: 4

I shall answer the question you really wanted 7 to ask, which is how to do fast sRGB <-> linear 6 RGB conversion. To do this precisely and 5 efficiently we can use polynomial approximations. The 4 following polynomial approximations have 3 been generated with sollya, and have a worst 2 case relative error of 0.0144%.

inline double poly7(double x, double a, double b, double c, double d,
                              double e, double f, double g, double h) {
    double ab, cd, ef, gh, abcd, efgh, x2, x4;
    x2 = x*x; x4 = x2*x2;
    ab = a*x + b; cd = c*x + d;
    ef = e*x + f; gh = g*x + h;
    abcd = ab*x2 + cd; efgh = ef*x2 + gh;
    return abcd*x4 + efgh;
}

inline double srgb_to_linear(double x) {
    if (x <= 0.04045) return x / 12.92;

    // Polynomial approximation of ((x+0.055)/1.055)^2.4.
    return poly7(x, 0.15237971711927983387,
                   -0.57235993072870072762,
                    0.92097986411523535821,
                   -0.90208229831912012386,
                    0.88348956209696805075,
                    0.48110797889132134175,
                    0.03563925285274562038,
                    0.00084585397227064120);
}

inline double linear_to_srgb(double x) {
    if (x <= 0.0031308) return x * 12.92;

    // Piecewise polynomial approximation (divided by x^3)
    // of 1.055 * x^(1/2.4) - 0.055.
    if (x <= 0.0523) return poly7(x, -6681.49576364495442248881,
                                      1224.97114922729451791383,
                                      -100.23413743425112443219,
                                         6.60361150127077944916,
                                         0.06114808961060447245,
                                        -0.00022244138470139442,
                                         0.00000041231840827815,
                                        -0.00000000035133685895) / (x*x*x);

    return poly7(x, -0.18730034115395793881,
                     0.64677431008037400417,
                    -0.99032868647877825286,
                     1.20939072663263713636,
                     0.33433459165487383613,
                    -0.01345095746411287783,
                     0.00044351684288719036,
                    -0.00000664263587520855) / (x*x*x);
}

And the sollya 1 input used to generate the polynomials:

suppressmessage(174);
f = ((x+0.055)/1.055)^2.4;
p0 = fpminimax(f, 7, [|D...|], [0.04045;1], relative);
p = fpminimax(f/(p0(1)+1e-18), 7, [|D...|], [0.04045;1], relative);
print("relative:", dirtyinfnorm((f-p)/f, [s;1]));
print("absolute:", dirtyinfnorm((f-p), [s;1]));
print(canonical(p));

s = 0.0523;
z = 3;
f = 1.055 * x^(1/2.4) - 0.055;

p = fpminimax(1.055 * (x^(z+1/2.4) - 0.055*x^z/1.055), 7, [|D...|], [0.0031308;s], relative)/x^z;
print("relative:", dirtyinfnorm((f-p)/f, [0.0031308;s]));
print("absolute:", dirtyinfnorm((f-p), [0.0031308;s]));
print(canonical(p));

p = fpminimax(1.055 * (x^(z+1/2.4) - 0.055*x^z/1.055), 7, [|D...|], [s;1], relative)/x^z;
print("relative:", dirtyinfnorm((f-p)/f, [s;1]));
print("absolute:", dirtyinfnorm((f-p), [s;1]));
print(canonical(p));
Score: 3

Binomial series does account for a constant exponent, but 5 you will be able to use it only if you can 4 normalize all your input to the range [1,2). (Note 3 that it computes (1+x)^a). You'll have to 2 do some analysis to decide how many terms 1 you need for your desired accuracy.

Score: 1

For exponents of 2.4, you could either make 13 a lookup table for all your 2.4 values and 12 lirp or perhaps higher-order function to 11 fill in the in-betweem values if the table 10 wasn't accurate enough (basically a huge 9 log table.)

Or, value squared * value to 8 the 2/5s which could take the initial square 7 value from the first half of the function 6 and then 5th root it. For the 5th root, you 5 could Newton it or do some other fast approximator, though 4 honestly once you get to this point, your 3 probably better off just doing the exp and 2 log functions with the appropriate abbreviated 1 series functions yourself.

Score: 1

The following is an idea you can use with 17 any of the fast calculation methods. Whether 16 it helps things go faster depends on how 15 your data arrives. You can use the fact 14 that if you know x and pow(x, n), you can use the 13 rate of change of the power to compute a 12 reasonable approximation of pow(x + delta, n) for small delta, with 11 a single multiply and add (more or less). If 10 successive values you feed your power functions 9 are close enough together, this would amortize 8 the full cost of the accurate calculation 7 over multiple function calls. Note that 6 you don't need an extra pow calculation 5 to get the derivative. You could extend 4 this to use the second derivative so you 3 can use a quadratic, which would increase 2 the delta you could use and still get the same 1 accuracy.

Score: 1

So traditionally the powf(x, p) = x^p is solved by rewriting 22 x as x=2^(log2(x)) making powf(x,p) = 2^(p*log2(x)), which transforms the problem 21 into two approximations exp2() & log2(). This has 20 the advantage of working with larger powers 19 p, however the downside is that this is not 18 the optimal solution for a constant power 17 p and over a specified input bound 0 ≤ x ≤ 1.

When 16 the power p > 1, the answer is a trivial minimax 15 polynomial over the bound 0 ≤ x ≤ 1, which is the 14 case for p = 12/5 = 2.4 as can be seen below:

float pow12_5(float x){
    float mp;
    // Minimax horner polynomials for x^(5/12), Note: choose the accurarcy required then implement with fma() [Fused Multiply Accumulates]
    // mp = 0x4.a84a38p-12 + x * (-0xd.e5648p-8 + x * (0xa.d82fep-4 + x * 0x6.062668p-4)); // 1.13705697e-3
    mp = 0x1.117542p-12 + x * (-0x5.91e6ap-8 + x * (0x8.0f50ep-4 + x * (0xa.aa231p-4 + x * (-0x2.62787p-4))));  // 2.6079002e-4
    // mp = 0x5.a522ap-16 + x * (-0x2.d997fcp-8 + x * (0x6.8f6d1p-4 + x * (0xf.21285p-4 + x * (-0x7.b5b248p-4 + x * 0x2.32b668p-4))));  // 8.61377e-5
    // mp = 0x2.4f5538p-16 + x * (-0x1.abcdecp-8 + x * (0x5.97464p-4 + x * (0x1.399edap0 + x * (-0x1.0d363ap0 + x * (0xa.a54a3p-4 + x * (-0x2.e8a77cp-4))))));  // 3.524655e-5
    return(mp);
}

However when 13 p < 1 the minimax approximation over the bound 12 0 ≤ x ≤ 1 does not appropriately converge to the 11 desired accuracy. One option [not really] is 10 to rewrite the problem y=x^p=x^(p+m)/x^m where m=1,2,3 is a positive 9 integer, making the new power approximation 8 p > 1 but this introduces division which is inherently 7 slower.

There's however another option which 6 is to decompose the input x as its floating 5 point exponent and mantissa form:

x = mx* 2^(ex) where 1 ≤ mx < 2
y = x^(5/12) = mx^(5/12) * 2^((5/12)*ex), let ey = floor(5*ex/12), k = (5*ex) % 12
  = mx^(5/12) * 2^(k/12) * 2^(ey)

The minimax 4 approximation of mx^(5/12) over 1 ≤ mx < 2 now converges much 3 faster than before, without division, but 2 requires 12 point LUT for the 2^(k/12). The code 1 is below:

float powk_12LUT[] = {0x1.0p0, 0x1.0f38fap0, 0x1.1f59acp0,  0x1.306fep0, 0x1.428a3p0, 0x1.55b81p0, 0x1.6a09e6p0, 0x1.7f910ep0, 0x1.965feap0, 0x1.ae89fap0, 0x1.c823ep0, 0x1.e3437ep0};
float pow5_12(float x){
    union{float f; uint32_t u;} v, e2;
    float poff, m, e, ei;
    int xe;

    v.f = x;
    xe = ((v.u >> 23) - 127);

    if(xe < -127) return(0.0f);

    // Calculate remainder k in 2^(k/12) to find LUT
    e = xe * (5.0f/12.0f);
    ei = floorf(e);
    poff = powk_12LUT[(int)(12.0f * (e - ei))];

    e2.u = ((int)ei + 127) << 23;   // Calculate the exponent
    v.u = (v.u & ~(0xFFuL << 23)) | (0x7FuL << 23); // Normalize exponent to zero

    // Approximate mx^(5/12) on [1,2), with appropriate degree minimax
    // m = 0x8.87592p-4 + v.f * (0x8.8f056p-4 + v.f * (-0x1.134044p-4));    // 7.6125e-4
    // m = 0x7.582138p-4 + v.f * (0xb.1666bp-4 + v.f * (-0x2.d21954p-4 + v.f * 0x6.3ea0cp-8));  // 8.4522726e-5
    m = 0x6.9465cp-4 + v.f * (0xd.43015p-4 + v.f * (-0x5.17b2a8p-4 + v.f * (0x1.6cb1f8p-4 + v.f * (-0x2.c5b76p-8))));   // 1.04091259e-5
    // m = 0x6.08242p-4 + v.f * (0xf.352bdp-4 + v.f * (-0x7.d0c1bp-4 + v.f * (0x3.4d153p-4 + v.f * (-0xc.f7a42p-8 + v.f * 0x1.5d840cp-8))));    // 1.367401e-6

    return(m * poff * e2.f);
}

More Related questions