Fighting the C++ compiler one translation unit at a time

In recent times working with C++, the cost of compilation time has become an increasingly large concern to me. Every second spent waiting on the compiler is a second wasted, and when developing in quick iterations those seconds end up eating multiple minutes out of every development hour. In a small project I'm working on that involves a bit of template abuse, fewer than 10K lines of code was beginning to take in the realm of over ten seconds for a full build - and that was after going through, forward declaring where possible, and moving as many includes from header to source as worked out. Even small changes were costing a few seconds for a partial recompile. It isn't a great situation to be in, and it was only going to get worse with project growth. I started looking at how a C++ program gets from source to binary, and how that knowledge can be used to ease the process.

C++ sucks at compiling

The hard truth of the matter is that the compilation process for C++ is a broken mess. The moment you include any STL container (or similar), any library using templates, anything remotely templatey, you're immediately engaging in a fight against the compiler.

I'm going to examine some small programs and their compilation times. All tests (unless otherwise specified) are run with g++ version 4.9 on an Intel Xeon E5-2650 cloud instance) and eyeballed for what looks "about average". Not definitively precise, and newer compilers provide some less extreme results than shown here, so following along at home is recommended.

#include <unordered_map>
#include <vector>

struct Foo1 {
    std::unordered_map<int, std::vector<float>> data;
};

int main()
{
    Foo1 f1;
    return 0;
}

This snippet takes around 0.35 seconds on my machine (an Intel Xeon E5-2650 cloud instance). unordered_map is a pretty heavy structure, but a useful one. I'd call that cost worthwhile. Here's what's happening when we include a templated type in our application: when the compiler gets to a declaration of a type, such as std::vector<float>, it endeavours to compile that class with the template type inserted. It pastes the template type float where specified in that class, performing some type checking along the way. It will do this for each templated type: declaring std::vector<float> and then an std::vector<int> builds two versions of the vector, each with the given types inserted.

Following on from the previous example, I want to move Foo1 to its own file. We'll refactor like so:

// Foo1.h
#include <unordered_map>
#include <vector>

struct Foo1 {
    std::unordered_map<int, std::vector<float>> data;
};

// Foo1.cpp
#include "Foo1.h"

// main.cpp
#include "Foo1.h"

int main()
{
    Foo1 f1;
    return 0;
}

Maybe eventually I'm going to write a constructor, destructor, some member functions which will make foo1.cpp worth compiling, but leaving it empty works for this example. Our build command for this application will need to compile both main.cpp and foo1.cpp, like so:

g++ --std=c++11 main.cpp foo1.cpp

And the compile time here is up to 0.75-0.8ms. A pretty sharp increase for what essentially boils down to moving a class to a different file. To see how this scales, we can copy and paste Foo1 a few times to make foo1-5.h/cpp. Including them all in main.cpp and adjust the build accordingly, ending up with this:

// main.cpp
#include "foo1.h"
#include "foo2.h"
#include "foo3.h"
#include "foo4.h"
#include "foo5.h"

int main()
{
    Foo1 f1;
    Foo2 f2;
    Foo3 f3;
    Foo4 f4;
    Foo5 f5;
}
# build command
g++ --std=c++11 main.cpp foo1.cpp foo2.cpp foo3.cpp foo4.cpp foo5.cpp

Start compiling, and we end up with ... a 1.65 second compilation time. What went wrong? We've got a single declaration using the same types in each five different places. Why is our compilation increasing linearly for each time we have this in a different file? This is where we get to the whole "C++ sucks at compiling side of things".

What is the compiler actually doing?

When we send a source file to the compiler, that source file becomes involved in what is called a translation unit - all include and preprocessor directives involved in that source file (declared in the file itself, and through the expansion of headers) are evaluated. In our first example where we just build main.cpp, the compiler receives main.cpp, sees the includes for unordered_map and vector and evaluates their content. It gets to our data declaration and sees that it needs to compile a specific version of the template types, and does so.

Three things are happening in the build we've been using:

  • Preprocessing, where things like header includes are expanded and macros are pasted.
  • Compilation, where the source code is boiled down to machine instructions and references to any external dependencies.
  • Linking, where those machine instructions are turned into the final binary output and any external dependencies are resolved.

Take the following commands:

# Combined compile and link
g++ --std=c++11 main.cpp

# Compile and link separately
g++ --std=c++11 -c main.cpp
g++ --std=c++11 main.o

Both sets of commands end up producing a binary file (by default named a.out with GCC). In the second set of commands, when we send a sources files with the -c flag, we tell the compiler to only compile - don't invoke the linker, and output object files (*.o) containing the assembly and references from main.cpp. When we get to second command, we create the final binary from the object file.

The reason to mention this is that compiling to object files is done most projects is because it is useful when making changes to multiple files. When we look at our fooX.cpp example, the current build is compiling our five foos (plus main) separately, every time. If we were to introduce a build system to the project (such as Makefiles) and output to objects, the build system would be able to tell if any of the source files have changed since they were last compiled and decide whether to recompile them. This means we could make a change to foo3.cpp, and when invoking our build only take the cost of recompiling that one file instead of all six.

The problem

This brings us to the cost of templates in traditional C++ compilation. Let's be the compiler, running through main.cpp and our foos. We're in the translation unit for main.cpp. We hit the foo1.h include and expand it, to find our std::unordered_map<int, std::vector<float>>. We fill in those template types. After that we move on to expand foo2.h. We see the template types again, but we know we've already compiled those types before - great! We can expand all of our foos and only compile each template type once. Those definitions are put into main.o, ready for later use.

Then we move on to the translation unit for foo1.cpp. We expand foo1.h, hit our std::unordered_map<int, std::vector<float>> and ... we already know about this type, right?

Well ... no. This is where it all falls apart. Each translation unit is totally independent, meaning the translation unit for foo1.cpp retains no knowledge of what happened for the main.cpp translation unit. So we look the template type in foo1, and recompile that template type.

We move on to the foo2.cpp translation unit. We recompile the same template types again. On to foo3.cpp, we recompile the same template types again. And for foo4.cpp, again. And for foo5.cpp, again.

We've got six object files now, one for each source file, and pass them over to the linker to produce our final binary. The linker is going to see six versions of std::vector and std::unordered_map>. What's it going to do with all those redundant definitions? It throws them away. The 1.65 seconds we spent waiting in the first example is mostly thrown away for the 0.35s the linker cares about. Our time, simply discarded.

The workaround

I'd like present "the solution", but I feel like that's giving it too much credit. The system I've found works for me is to have a Single Translation Unit (STU) build, otherwise called Unity builds. This sends a single source file to the compiler, and the entire program is compiled in a single translation unit. Our main.cpp includes change to this:

#include "foo1.cpp"
#include "foo2.cpp"
#include "foo3.cpp"
#include "foo4.cpp"
#include "foo5.cpp"

When we build, we only need to hand the compiler main.cpp. With no other changes to the source, ompiling this brings us down to a 0.35 second compilation time - the same as when we had Foo1 declared in main.cpp. By compiling everything in a single translation unit, the compiler for sure only compiles each template type once. It creates no redundant definitions that take precious time and resources to compile, and that the linker would just discard. The compiler only does what it needs to do.

The disadvantage of this is, of course, that everything is done in a single translation unit - every build is a full build. We lose the ability to only rebuild parts of the source which have changed, as we could when compiling each source file to an object file. On the surface this sounds somewhat bad, but in my experience the cost of a total recompilation isn't as high as some partial compilations. Based on the numbers we're currently seeing, a project declaring a large number of Foos in a single compilation unit would still compile more quickly than a partial rebuild affecting more than one source file. There may be a project size where the advantages of an STU build begin to degrade, but I wouldn't expect to see for a long time. In a project where the same template types are used even moderately throughout, the STU build cuts out a lot of extra effort the compiler would otherwise try to put in.

There are numerous other reasons outside of redundant template definitions to use an STU build - other redundancies that are caught which make them worthwhile even when working in a C project, or a C++ project without templates. However, the STU build is something most vocally championed by C programmers, and I haven't seen much mention of the seemingly greater advantage it brings to the table when dealing with C++ involving templates.

A real world example

I recently tried out an STU build on a project I'm working on. It's not a big project - fewer than 10K lines of code across 35 source files - but the template use (and abuse) going on made compilation time increase pretty quickly, and it was clear that it was only going to get worse as time went on. Even after meticulously going through and including only necessary headers in source files and forwarding declaring where, compilation still took a long time - 18 seconds, actually. Insanity! I didn't try precompiled headers since they feel like a complication I don't want to deal with, but your mileage may vary. This test and all that follow are with an i5-2500K at around 4Ghz, with clang 3.8.

Note: these times are genuinely real world, so they are also going on to link external dependencies the project uses, which affect the given times more than just measuring compilation.

One thing I didn't touch on when mentioning build systems is that many of them allow for concurrent jobs - Makefiles take a -j flag, which allows for multiple parallel jobs, meaning multiple translation units can be compiled at the same time. You're still going to be burning CPU power to end up with redundant definitions, but it cuts down execution time significantly. I've seen arguments that spawning jobs for multiple translation units concurrently negates the cost of an STU build. So let's take a look:

  • Full build with make -j8: 4.8 seconds
  • STU build: 2.4 seconds

Even at a small project size, a parallel build (-j8 specifies eight concurrent jobs) still takes twice as long as an STU build. Furthermore, the full build will only continue to generate more redundant code as it increases in size, and that time gap would widen.

The big lingering question is how long an incremental build takes with this system. I picked my render_manager_3d.h/cpp files to alter, since a render is something reasonably likely to be iterated on often. After a full build, partial recompilation occurs in these times when modifying these files separately:

  • render_manager_3d.cpp: 1.2 seconds
  • render_manager_3d.h: 1.75 seconds

Single file recompilation still beats the STU build, as expected. And conversely to the multithreaded makefile example, this gap will widen in favour of the partial recompilation, with the STU build taking longer as the project grows.

At this point, the effects of changes are arbitrary and project-dependent, making things hard to measure and making results less meaningful. When I change two renderer headers, the compilation is back up to 2.3 seconds, matching the full STU build, but this number grows or shrinks depending on the scope of the changes made. The conclusion I'm happy to make is that although an STU build is not a perfect solution, it provides consistent and relatively short compile times that often beat that seen with traditional compilation. I'm going to be sticking with STU builds unless reasonably convinced by some other factor later down the line.

What was the cost of converting my existing project with 35 source files to an STU build? A few minutes of work. Rather than go through my project including source files instead of headers as some choose to do, I simply made a file called stu.cpp, included every source file I want to build, and sent that to the compiler. Extremely low effort, and caused me no other problems. This also means I'm not stuck with the STU build should I change my mind in future, and can switch back to a traditional build with no friction.

With that, I'd recommend giving it a try and seeing how it fares: it has made my build process less complicated, and given me a better understanding of what the compiler and linker are actually going through when I send source files their way.