Optimizing away small classes

Given the great feedback I got on my Degrees and Radians post, and the discussions that followed regarding efficiency of the solution, I thought it would be fun to try and benchmark a few small changes and optimizations.


Disclaimer
I don’t claim to be any kind of expert when it comes to optimizations, nor do I expect the following observations to hold true on every platform or compiler. These are just things I observed from messing around on my machine on a Saturday morning. YMMV

My setup is as follows:
Macbook pro late 2008 model, OSX 10.8.3, 2.53 GHz Core 2 Duo, 4GB DDR3

zim:degrad $ clang++ --version
Apple LLVM version 4.2 (clang-425.0.28) (based on LLVM 3.2svn)
Target: x86_64-apple-darwin12.3.0
Thread model: posix
zim:degrad $ 
Used with the following options:
zim:degrad $ clang++ -std=c++11 -Wall -Wpedantic -O4 ./degrad.cpp


Pass-by-value vs pass-by-const-reference

The idea behind this being that, while it’s idiomatic to pass by const reference whenever possible, with small objects like these, it’s better to pass by value. Because Degrees and Radians just wrap floats and have no vtable pointers, they (in theory) are able to fit into a single register, and therefore the optimizer can defer memory writes, and reduce data indirection.

constexpr bool operator ==(const Radians& lhs, const Radians& rhs) const {
    return lhs.getValue() == rhs.getValue();
}
// vs
constexpr bool operator ==(Radians lhs, Radians rhs) const {
    return lhs.getValue() == rhs.getValue();
}

To benchmark this I whipped up a simple program that makes 3billion calls to functions taking angle values and ran it a few times with each variant. The results were surprising:

#include "degrad.h"
#include <stdio.h>

// Simple benchmarking program.
int main() {
	Degrees d = 90_deg;
	Radians r = Radians(PI);
	for (int i = 0; i < 1000000000; ++i) {
		d += r;
		r = -r;
	}
	// Print the result so the optimizer can't remove unused variables.
	printf("%f\n", d.getValue());
	printf("%f\n", r.getValue());
}
zim:degrad $ clang++ -std=c++11 -Wall -Wpedantic -O4 -DBY_REF ./degrad.cpp
zim:degrad $ time ./a.out 
90.0000003.141593
real	0m1.200s
user	0m1.192s
sys	0m0.003s
zim:degrad $ time ./a.out 
90.0000003.141593
real	0m1.202s
user	0m1.194s
sys	0m0.003s
zim:degrad $ time ./a.out 
90.0000003.141593
real	0m1.207s
user	0m1.193s
sys	0m0.003s
zim:degrad $ clang++ -std=c++11 -Wall -Wpedantic -O4 -DBY_VAL ./degrad.cpp
zim:degrad $ time ./a.out 
90.0000003.141593
real	0m1.199s
user	0m1.193s
sys	0m0.003s
zim:degrad $ time ./a.out 
90.0000003.141593
real	0m1.199s
user	0m1.192s
sys	0m0.003s
zim:degrad $ time ./a.out 
90.0000003.141593
real	0m1.200s
user	0m1.192s
sys	0m0.003s
zim:degrad $ 

Here we see very consistent performance across both versions. This indicates there's either no real benefit to pass-by-value for small objects, or that the optimizer is rewriting our code here a bit. Looking at the llvm IR (it's a bit easier to read than the assembly) things become more obvious:

Pass by reference inner loop:

; <label>:1                                       ; preds = %1, %0
  %i.03 = phi i32 [ 0, %0 ], [ %7, %1 ]
  %2 = phi float [ 9.000000e+01, %0 ], [ %5, %1 ]
  %3 = phi float [ 0x400921FB60000000, %0 ], [ %6, %1 ]
  %4 = fmul float %3, 0x404CA5DC00000000
  %5 = fadd float %4, %2 
  %6 = fsub float -0.000000e+00, %3 
  %7 = add nsw i32 %i.03, 1
  %exitcond = icmp eq i32 %7, 1000000000
  br i1 %exitcond, label %8, label %1 

Pass by value inner loop:

; <label>:1                                       ; preds = %1, %0
  %i.06 = phi i32 [ 0, %0 ], [ %6, %1 ]
  %2 = phi float [ 9.000000e+01, %0 ], [ %4, %1 ]
  %tmp345 = phi float [ 0x400921FB60000000, %0 ], [ %5, %1 ]
  %3 = fmul float %tmp345, 0x404CA5DC00000000
  %4 = fadd float %2, %3 
  %5 = fsub float -0.000000e+00, %tmp345
  %6 = add nsw i32 %i.06, 1
  %exitcond = icmp eq i32 %6, 1000000000
  br i1 %exitcond, label %7, label %1 

Aha! The clang optimizer is so smart that not only is it completely disregarding whether we pass by const reference or by value, it's also inlining all the operations and changing it to raw float math. Indeed, if we do the same computations with just raw floats we get identical resulting bytecode:

#include "degrad.h"
#include <stdio.h>

int main() {
	float d = 90.0;
	float r = PI;
	for (int i = 0; i < 1000000000; ++i) {
		d += r * RAD2DEG;
		r = -r;
	}
	printf("%f", d);
	printf("%f", r);
}

Float inner loop:

; <label>:1                                       ; preds = %1, %0
  %i.03 = phi i32 [ 0, %0 ], [ %5, %1 ]
  %d.02 = phi float [ 9.000000e+01, %0 ], [ %3, %1 ]
  %r.01 = phi float [ 0x400921FB60000000, %0 ], [ %4, %1 ]
  %2 = fmul float %r.01, 0x404CA5DC00000000
  %3 = fadd float %d.02, %2 
  %4 = fsub float -0.000000e+00, %r.01
  %5 = add nsw i32 %i.03, 1
  %exitcond = icmp eq i32 %5, 1000000000
  br i1 %exitcond, label %6, label %1 

Conclusions:
It doesn't matter. Because the classes are so simple, Clang just replaces them with raw floats which are passed by value. Given that it has no impact on efficiency, I'd say it's best to stick with the idiomatic pass by const reference for consistency's sake.


In-class operators or out-of-class operators?

There was also some discussion on whether the non-unary operators should exist outside the class definition or not. Herb Sutter, a prominent C++ expert and author of many books on C++ style, says on his blog that you should define it outside the class definition. The rationale is that when defined in the class definition, the lefthand operator is immune to implicit conversions, while defining the operator outside the class allows for implicit conversions. In this case we've defined operators in the class for both types as follows:

class Radians {
    // ...
    constexpr Radians operator +(const Radians& rhs) const {
        return Radians(value + rhs.value);
    }
    // ...
};

class Degrees {
    // ...
    constexpr Degrees operator +(const Degrees& rhs) const {
        return Degrees(value + rhs.value);
    }
    // ...
};

This sidesteps the implicit conversion problem Herb describes, because we want implicit conversions only between Radians and Degrees, and have defined operator+ in both classes. If we wanted, instead, to allow floats to implicitly convert to Degrees, we should then define the operator outside the class so that Degrees d = 90.0f + 90_deg; will compile and do what we expect. In our case though, we intentionally want to forbid float-to-angle implicit conversions, so this modification is moot. It's probably best to have the operator defined outside the class, in the case we want implicit conversions to occur with other types later. Also, this is more idiomatic, which makes the code easier to understand.


Conclusions

While it's probably true for some compilers that the manner in which the Degrees or Radians classes are passed matters, and that their mere usage causes a performance hit, this is not true for clang. Clang's excellent optimizer can completely replace all occurrences of these classes with raw floats and float operations. The Degrees and Radians classes in this situation then become just compile-time constructs that generate the required conversion code where it's needed. This really shows this solutions greatest benefit: the programmer doesn't need to think about doing the right thing, it just happens automatically. Also, it's free.