Yesterday, the team behind the GNU project announced Parallel GCC, a research project aiming to parallelize a real-world compiler. Parallel GCC can be used in machines with many cores where GNU cannot provide enough parallelism. A parallel GCC can be also used to design a parallel compiler from scratch.
GCC is an optimizer compiler that automatically optimizes code when compiling. GCC optimization phase involves three steps:
- Inter Procedural Analysis (IPA): This builds a callgraph and uses it to decide how to perform optimizations.
- GIMPLE Intra Procedural Optimizations: This performs several hardware-independent optimizations inside the function.
- RTL Intra Procedural Optimizations: This performs several hardware-dependent optimizations inside the function.
As IPA collects information and decides how to optimize all functions, it then sends a function to the GIMPLE optimizer, which then sends the function to the RTL optimizer, and the final code is generated. This process repeats for every function in the code.
Why a Parallel GCC?
The team designed the parallel architecture to increase parallelism and reduce overhead. While IPA finishes its analysis, a number of threads equal to the number of logical processors are spawned to avoid scheduling overhead. Further, one of those thread inserts all analyzed functions into a threadsafe producer-consumer queue, which all threads are responsible to consume. Once a thread has finished processing one function, it queries for the next function available in the queue, until it finds an EMPTY token. When it happens, the thread should finalize as there are no more functions to be processed.
This architecture is used to parallelize per-function GIMPLE Intra Process Optimizations and can be easily extended to also support RTL Intra Process Optimizations. This, however, does not cover IPA passes nor the per-language Front End analysis.
Code refactoring to achieve Parallel GCC
The team refactored several parts of the GCC middle-end code in the Parallel GCC project. The team says there are still many places where code refactoring is necessary for this project to succeed.
“The original code required a single function to be optimized and outputted from GIMPLE to RTL without any possible change of what function is being compiled,” the researchers wrote in their official blog. Several structures in GCC were made per-thread or threadsafe, either being replicated by using the C11 thread notation, by allocating the data structure in the thread stack, or simply inserting locks.
“One of the most tedious parts of the job was detecting making several global variables threadsafe, and they were the cause of most crashes in this project. Tools made for detecting data-races, such as Helgrind and DRD, were useful in the beginning but then showed its limitations as the project advanced. Several race conditions had a small window and did not happen when the compiler ran inside these tools. Therefore there is a need for better tools to help to find global variables or race conditions,” the blog mentions.
The team compiled the file gimple-match.c, the biggest file in the GCC project. This file has more than 100,000 lines of code, with around 1700 functions, and almost no loops inside these functions.
The computer used in this Benchmark had an Intel(R) Core(TM) i5-8250U CPU, with 8Gb of RAM. Therefore, this computer had a CPU with 4 cores with Hyperthreading, resulting in 8 virtual cores. The following are the results before and after Intra Procedural GIMPLE parallelization.
The figure shows our results before and after Intra Procedural GIMPLE parallelization. In this figure, we can observe that the time elapsed, dropped from 7 seconds to around 4 seconds with 2 threads and around 3 seconds with 4 threads, resulting in a speedup of 1.72x and 2.52x, respectively. Here we can also see that using Hyperthreading did not impact the result. This result was used to estimate the improvement in RTL parallelization.
The above results show when compared with the total compilation time, there is a small improvement of 10% when compiling the file.
In this figure using the same approach as in the previous graph, users can estimate a speedup of 1.61x in GCC when it gets parallelized by using the speedup information obtained in GIMPLE.
The team has suggested certain To-Dos for users wanting to implement parallel GCC:
- Find and fix all race conditions in GIMPLE. There are still random crashes when a code is compiled using the parallel option.
- Make this GCC compile itself.
- Make this GCC pass all tests in the testsuite.
- Add support to a multithread environment to Garbage Collector.
- Parallelize RTL part. This will improve our current results, as indicated in Results chapter.
- Parallelize IPA part. This can also improve the time during LTO compilations.
- Refactor all occurrences of thread by allocating these variables as soon as threads are started, or at a pass execution.
GCC project members say that this project is under development and still has several bugs.
A user on Hacker News writes, “I look forward to this. One that will be important for reproducible builds is having tests for non-determinism. Having nondeterministic code gen in a compiler is a source of frustration and despair and sucks to debug.”
To know about the Parallel GCC in detail, read the official document.