Testing 3 Approaches for Optimizing the Performance of C++ Apps: LTO, PGO, and Unity Builds
One of the great things about C++ is that it allows you to achieve near optimal performance for a wide variety of tasks. Indeed, it has maintained prominence, even with so many existing languages, in part because of its ability to provide such great performance.
Most of the time, however, the full potential of C++ compilers remains unused. By default, CMake and other build systems set only -O
compiler flags to adjust the compiler optimization level. -O0
flags are used for debug builds, and -O3
(/O2
for MSVC) flags are used for release builds.
In most cases, this is more than enough. When you want to achieve the best possible performance, however, you need to take additional steps. Of course, the first step for any developer is to optimize the code and algorithms used, but there’s always a point where further improvement through code adjustment is not possible. When you reach this point, you can try to use compiler optimizations to get an extra performance improvement of 10-15% (or even more) without having to manipulate any code.
Approaches and test preparations
Several kinds of flags can be used to apply further optimizations, and they are more or less the same for all the major compilers:
- LTO – link-time optimization, which receives extensive treatment in this description from LLVM.
- PGO – profile-guided optimization, which is a multi-step optimization for when you have a good understanding of the common use-cases for your app.
- Unity (Jumbo) builds – builds that automatically mix your source files to form bigger Translation Units, though they usually don’t work without special tweaks.
Let’s take a look at all of them and try to use them with all of the major compilers in CLion. We will use Windows because it has both GCC (in the form of MinGW or cygwin) and Clang, so on OS X and Linux you can use the same flags with these compilers.
Here’s the full list of the compilers we will be working with:
- The latest version of the Microsoft Visual C++ compiler (MSVC)
- MinGW GCC
- Clang-cl (MSVC toolchain)
- MinGW Clang
LTO also depends on the linker used.
PGO may depend on the standard library used (libc++ or libstdc++), but we will try to use the most common combinations of compilers and linkers.
Regardless of the compiler used, PGO builds always consist of 3 steps:
- Performing first build, which generates the profiling information when you run the test/executable.
- Running the test/executable and, depending on the compiler, merging the profiling results.
- Building again using the merged profiling results.
As for the Unity builds, we’ll return to them at the end of this article.
We won’t test the Intel compiler here. However, it is now based on LLVM code, which means you should be able to use the approach for Clang-cl on Windows or clang on other operating systems.
Sample project for the performance tests
As an example application for the optimization, we will use clangd, one of the tools upon which CLion’s code assistance engine is built. JetBrains develops the LLVM fork but all the steps shown here could be applied to clangd upstream as well.
We added 2 special unit tests for the purpose of this article – one for the performance measurements and another one for the profiling step of PGO. You may add something similar to your project or do your profiling and testing manually.
The performance test should do the most common things that the app being tested is used for. In a game, for example, this might require beating one of the monsters with the best loot, or in a trading app it might mean completing financial operations. In our case, here is what clangd does in typical CLion use cases:
- Take big source code files.
- Parse them.
- Wait for diagnostics, highlightings and code completion cache.
- Wait for inlay hints.
- Invoke code completion in different positions (let’s say 10 times for the purposes of our test).
For consistency, we will take the same big files from the LLVM source code for each test and perform all the listed actions on them. We’ll trigger this test 3 times in each case and calculate the average.
The test that is used to collect profiling information should perform exactly the same actions. If your app always works with the same data (for example, if everybody beats the same game) you can reuse the same test for profiling. In other cases – if you don’t know which data a user of your app might have – it’s good to use a different set of data for profiling.
With clangd, we don’t know in advance which source files are going to be opened by our users, so let’s run profiling on different sets of source files.
The comparison
- MSVC from Visual Studio 2022
- The default release build, which is the build that most people use
CMake options:-G Ninja -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DLLVM_TARGETS_TO_BUILD=X86 -DLLVM_USE_CRT_RELEASE=’MT’
The average time for the performance test was 73.6 seconds.
- With LTO enabled
CMake options:-G Ninja -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DLLVM_TARGETS_TO_BUILD=X86 -DLLVM_USE_CRT_RELEASE=’MT’ -DCMAKE_SHARED_LINKER_FLAGS_RELEASE=”/LTCG” -DCMAKE_EXE_LINKER_FLAGS_RELEASE=”/LTCG”
The average time was 55.3 seconds.
- With PGO and LTO enabled. MSVC documentation recommends always using LTO when building with PGO
CMake options for profiling:-G Ninja -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DLLVM_TARGETS_TO_BUILD=X86 -DLLVM_USE_CRT_RELEASE=’MT’ -DCMAKE_SHARED_LINKER_FLAGS_RELEASE=”/LTCG” -DCMAKE_EXE_LINKER_FLAGS_RELEASE=”/LTCG /FASTGENPROFILE” -DCMAKE_C_FLAGS=”/GL” -DCMAKE_CXX_FLAGS=”/GL”
This build required us to copy pgort140.dll to the location of the executable used for profiling.
The command to merge the data:pgomgr" /merge <*.pgc files> result.pgd
CMake options for the final build:
-G Ninja -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DLLVM_TARGETS_TO_BUILD=X86 -DLLVM_USE_CRT_RELEASE=’MT’ -DCMAKE_SHARED_LINKER_FLAGS_RELEASE=”/LTCG” -DCMAKE_EXE_LINKER_FLAGS_RELEASE=”/LTCG /USEPROFILE:PGD=<path to .pgd file>” -DCMAKE_C_FLAGS=”/GL” -DCMAKE_CXX_FLAGS=”/GL”
The average time was 61.6 seconds.
So, the performance actually got worse after profiling. More source files in the profiling set could probably yield better results, but for the sake of our experiment we’re going to use the same set of profiling data for all compilers. Let’s see what happened with other compilers.
- The default release build, which is the build that most people use
- MinGW. The version used here is from MSYS2. MinGW was installed via
pacman -S mingw-w64-x86_64-toolchain
, and the compiler used was GCC 11.2.0.- The default build
CMake options:-G Ninja -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DLLVM_TARGETS_TO_BUILD=X86
The average time was 56.7 seconds.
- With LTO enabled
CMake options:-G Ninja -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DLLVM_TARGETS_TO_BUILD=X86 -DCMAKE_C_FLAGS=”-flto” -DCMAKE_CXX_FLAGS=”-flto”
The average time was 55.0 seconds.
- With only PGO enabled
CMake options for profiling:-G Ninja -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DLLVM_TARGETS_TO_BUILD=X86 -DCMAKE_C_FLAGS=”-fprofile-generate=<path>” -DCMAKE_CXX_FLAGS=”-fprofile-generate=<path>”
GCC doesn’t require the profiling results to be merged. It just uses the whole folder.
CMake options for the final build:-G Ninja -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DLLVM_TARGETS_TO_BUILD=X86 -DCMAKE_C_FLAGS=”-fprofile-use=<path>” -DCMAKE_CXX_FLAGS=”-fprofile-use=<path>”
The average time was 60.3 seconds.
- With PGO and LTO enabled
The profiling build was taken from the previous step.
CMake options for the final build:-G Ninja -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DLLVM_TARGETS_TO_BUILD=X86 -DCMAKE_C_FLAGS=”-flto -fprofile-use=<path>” -DCMAKE_CXX_FLAGS=”-flto -fprofile-use=<path>”
The average time was 56.1 seconds.
This was almost the same as with only LTO enabled.
- The default build
- Clang-cl. MSVC 2022 + clang-cl from LLVM-12 (official installer)
- The default build
CMake options:-G Ninja -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DLLVM_TARGETS_TO_BUILD=X86 -DLLVM_USE_CRT_RELEASE=’MT’ -DLLVM_ENABLE_LLD=ON -DLLVM_ENABLE_ZLIB=OFF -DCMAKE_LINKER=lld-link
The average time was 70.8 seconds.
- With LTO enabled
CMake options:-G Ninja -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DLLVM_TARGETS_TO_BUILD=X86 -DLLVM_USE_CRT_RELEASE=’MT’ -DLLVM_ENABLE_LLD=ON -DLLVM_ENABLE_ZLIB=OFF -DCMAKE_LINKER=lld-link -DCMAKE_C_FLAGS=”-flto=thin” -DCMAKE_CXX_FLAGS=”-flto=thin”
The average time was 63.2 seconds.
- With only PGO enabled
CMake options for the profiling:-G Ninja -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DLLVM_TARGETS_TO_BUILD=X86 -DLLVM_USE_CRT_RELEASE=’MT’ -DLLVM_ENABLE_LLD=ON -DLLVM_ENABLE_ZLIB=OFF -DCMAKE_LINKER=lld-link -DCMAKE_C_FLAGS=”/clang:-fprofile-instr-generate=<path>/code-%p%m.profraw” -DCMAKE_CXX_FLAGS=”/clang:-fprofile-instr-generate=<path>/code-%p%m.profraw”
The command to merge the data:
llvm-profdata merge -output=<profile_dir>\code.profdata <profile_dir>\*.profraw
CMake options for the final build:
-G Ninja -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DLLVM_TARGETS_TO_BUILD=X86 -DLLVM_USE_CRT_RELEASE=’MT’ -DLLVM_ENABLE_LLD=ON -DLLVM_ENABLE_ZLIB=OFF -DCMAKE_LINKER=lld-link -DCMAKE_C_FLAGS=”-Wno-profile-instr-unprofiled -Wno-profile-instr-out-of-date /clang:-fprofile-instr-use=<profile_dir>/code.profdata” -DCMAKE_CXX_FLAGS=”-Wno-profile-instr-unprofiled -Wno-profile-instr-out-of-date /clang:-fprofile-instr-use=<profile_dir>/code.profdata”
The average time was 49.6 seconds.
- With both PGO and LTO enabled
The profiling build was again taken from the previous step.
CMake options for the final build:-G Ninja -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DLLVM_TARGETS_TO_BUILD=X86 -DLLVM_USE_CRT_RELEASE=’MT’ -DLLVM_ENABLE_LLD=ON -DLLVM_ENABLE_ZLIB=OFF -DCMAKE_LINKER=lld-link -DCMAKE_C_FLAGS=”-Wno-profile-instr-unprofiled -Wno-profile-instr-out-of-date -flto=thin /clang:-fprofile-instr-use=<profile_dir>/code.profdata” -DCMAKE_CXX_FLAGS=”-Wno-profile-instr-unprofiled -Wno-profile-instr-out-of-date -flto=thin /clang:-fprofile-instr-use=<profile_dir>/code.profdata”
The average time was 46.1 seconds.
This time, the default build was quite slow but optimizations really boosted performance. This suggests that the amount of profiling data we’re using in this experiment is sufficient to yield good results with clang-cl.
- The default build
- Clang with MinGW. The version used here was from MSYS2 and was installed via
pacman -S mingw-w64-clang-x86_64-toolchain
. The compiler was clang 13.0.1, and libc++ was used as a standard library.- The default build
CMake options:-G Ninja -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DLLVM_TARGETS_TO_BUILD=X86 -DHAVE_CXX_ATOMICS_WITHOUT_LIB=1
The option for the atomics is required for LLVM so that CMake runs without errors.
The average time was 50.1 seconds. - With LTO enabled
CMake options:-G Ninja -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DLLVM_TARGETS_TO_BUILD=X86 -DHAVE_CXX_ATOMICS_WITHOUT_LIB=1 -DCMAKE_C_FLAGS=”-flto=thin” -DCMAKE_CXX_FLAGS=”-flto=thin”
The average time was 49.1 seconds.
- With only PGO enabled
CMake options for the profiling:-G Ninja -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DLLVM_TARGETS_TO_BUILD=X86 -DHAVE_CXX_ATOMICS_WITHOUT_LIB=1 -DCMAKE_C_FLAGS=”-fprofile-instr-generate=<path>/code-%p%m.profraw” -DCMAKE_CXX_FLAGS=”-fprofile-instr-generate=<path>/code-%p%m.profraw”
The command to merge the data:
llvm-profdata merge -output=<profile_dir>\code.profdata <profile_dir>\*.profraw
CMake options for the final build:
-G Ninja -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DLLVM_TARGETS_TO_BUILD=X86 -DHAVE_CXX_ATOMICS_WITHOUT_LIB=1 -DCMAKE_C_FLAGS=”-fprofile-instr-use=<profile_dir>/code.profdata” -DCMAKE_CXX_FLAGS=”-fprofile-instr-use=<profile_dir>/code.profdata”
The average time was 46.1 seconds.
- With both PGO and LTO enabled
Once again the profiling build was taken from the previous step.
CMake options for the final build:-G Ninja -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DLLVM_TARGETS_TO_BUILD=X86 -DHAVE_CXX_ATOMICS_WITHOUT_LIB=1 -DCMAKE_C_FLAGS=”-flto=thin -fprofile-instr-use=<profile_dir>/code.profdata” -DCMAKE_CXX_FLAGS=”-flto=thin -fprofile-instr-use=<profile_dir>/code.profdata”
The average time was 41.4 seconds.
These results were even better than with clang-cl! It seems that this build was the best so far.
- The default build
Of course the results for your application may be different, and a different compiler may deliver the best performance.
The effect of PGO is hard to predict, so your results may vary depending on how well you picked your training data and how much of it you have.
LTO and PGO results
MSVC 2022 | GCC (MinGW) | Clang-cl | Clang (MinGW) | |
Default | 73.6 | 56.7 | 70.8 | 50.1 |
LTO | 55.3 | 55.0 | 63.2 | 49.1 |
PGO | – | 60.3 | 49.6 | 46.1 |
LTO and PGO | 61.6 | 56.1 | 46.1 | 41.4 |
Unity builds
There’s one more approach, the Unity (Jumbo) build, but it is significantly harder to utilize.
The difficulty of this approach is primarily due to it requiring you to modify the source code and maintain the build infrastructure. Let’s take the best build combination we have found so far (clang compiler in our case) and compare the performance results to see whether we can further improve the performance of clangd.
The LLVM community received a suggestion to give Unity builds a try, but the proposal didn’t get enough support and was never implemented. And without upstream support, the Unity build of LLVM just doesn’t compile. So for our test, we had to make modifications to some CMake and source files. After making these changes, we enabled the Unity build in most of the LLVM modules.
Most targets received these CMake lines:
set_target_properties(${name} PROPERTIES UNITY_BUILD TRUE UNITY_BUILD_MODE BATCH UNITY_BUILD_BATCH_SIZE ${batch_size} )
Targets that were the least compatible with the Unity build were marked with:
set_target_properties(${name} PROPERTIES UNITY_BUILD FALSE)
And there were some targets that mostly build fine with several exceptions. They got:
set_source_files_properties(${src_files} PROPERTIES SKIP_UNITY_BUILD_INCLUSION ON)
Some files also required explicit qualification for their types so that ambiguity would be resolved during the build.
Unity build results
- Default clang build with a batch size of 4
The average time was 50.6 seconds. - Default clang build with a batch size of 8
The average time was 49.1 seconds. - clang build with both PGO and LTO, and a batch size of 8
The average time was 40.3 seconds.
The results seem to be a bit better than without the Unity build, but the difference is not significant.
What was a bit more noticable, however, was the build time. For our TeamCity Windows agent, the full build took 45 minutes instead of the usual 55 minutes.
Conclusions
- LTO provides a performance boost for all the compilers. With small projects, this boost probably wouldn’t be noticeable, but for big ones this option definitely makes a difference.
- PGO is tricky and sometimes may even decrease the performance if you don’t have enough training data or if the data you do have isn’t very good. But if you carefully pick your data and compare different compilers, you can get a performance boost of more than 40% compared with the default build – without changing any code!
- Unity builds didn’t improve application performance much for us, but they did improve build times. Depending on your project, you may not notice any real change or you may see significant improvement for your build times.
Ultimately, the best option is not to stick to the first compiler you try but rather to experiment with a range of options. It’s never obvious which one will yield the best results. And it is very likely that at least one of the optimizations we’ve introduced here will lead to performance improvements for your project.
Please share your experience with build optimization in the comments! Let us know if we missed any approaches you use to optimize your apps’ performance or build times.