C++ Optimization Trick

Maybe this is old hat, but when I was writing a software rasterizer several years ago I came up with a trick for getting optimized variations for different render states (textured, flat shaded, Gouraud shaded, wireframe, z-write, z-test, cull-ccw, etc) without hand-writing multiple permutations of the rasterizer as is recommended by LaMothe and others.

It requires an optimizing C++ compiler that supports template functions. The trick is to pack as much state as possible into a single POD value like a 32-bit integer, write your function so that all of the if() tests for render state look at the integer for branching decisions, and call a template function that takes this value as a template parameter in a constant context like a switch statement.

Here's the base function full of nasty branches:

result_t func(int flags, const data_t& data) {

 if (flags & OPT_A) {
   // ...
 }

 if (flags & OPT_B) {
   if (flags & OPT_C) {
     // ...
   }
   else {
     // ...
   }
 }
 else {
   // ...
  }
}

Write this wrapper template function:

template <int flags>
result_t func(const data_t& data) {
  return func(flags, data);
}

Then call the versions you need optimized with this wrapper function:

result_t call_func(int flags, const data_t& data) {
  switch (flags) {
    case 0: return func<0>(data);
    case OPT_A: return func<OPT_A>(data);
    case OPT_A | OPT_B: return func<OPT_A | OPT_B>(data);
    case OPT_A | OPT_C: return func<OPT_A | OPT_C>(data);
    case OPT_A | OPT_B | OPT_C: return func<OPT_A | OPT_B | OPT_C>(data);
    default: return func(flags, data);
  }
}

Or with macros can be used to improve readability:

#define OPTIMIZE_CASE(f,n,d) case (n): return f<n>(d)
result_t call_func(int flags, const data_t& data) {
  switch (flags) {
    OPTIMIZE_CASE(func, 0, data);
    OPTIMIZE_CASE(func, OPT_A, data);
    OPTIMIZE_CASE(func, OPT_A | OPT_B, data);
    OPTIMIZE_CASE(func, OPT_A | OPT_C, data);
    OPTIMIZE_CASE(func, OPT_A | OPT_B | OPT_C, data);
    default: return func(flags, data);
  }
}

The compiler will expand the template to produce separate versions of the function with the specified flags as a constant template parameter, and most of the branches will be optimized away.

Note that you don't need to include every permutation, just the variations you expect to be used. When I used this in a game prototype I had the default clause emit warnings so that I knew when I'd tripped over a render state that I had not yet marked for optimization.

Comments