The topics I chose for my presentation are the following optimization options. I'll be explaining the options with code examples and not assembler examples since I couldn't compile my code with the right compiler flags.
-floop-interchange
GCC Manual Description:
Perform loop interchange transformations on loops. Interchanging two nested loops switches the inner and outer loops. For example, given a loop like:
DO J = 1, M
DO I = 1, N
A(J, I) = A(J, I) * C
ENDDO
ENDDO
loop interchange transforms the loop as if it were written:
DO I = 1, N
DO J = 1, M
A(J, I) = A(J, I) * C
ENDDO
ENDDO
which can be beneficial when N
is larger than the caches, because in Fortran, the elements of an array are stored in memory contiguously by column, and the original loop iterates over rows, potentially creating at each access a cache miss. This optimization applies to all the languages supported by GCC and is not limited to Fortran. To use this code transformation, GCC has to be configured with --with-ppland --with-cloog to enable the Graphite loop transformation infrastructure.
So after reading the manual and the wiki page, I went right into writing code, which developed into the following:
int main() {
int array[1000][1000];
for(int i=0; i<1000; i++){
for(int j=0; j<1000; j++){
array[j][i] = i*j;
}
}
}
When I tried to compile this with the -floop-interchange option, I saw no differences in the assembler code. Later I figured out that the compiler we're using is not configured with the --with-ppland option. So it made sense that there was no assembler to show for it.
After this I decided to just show in code what would change when the option is enabled, which resulted into the following:
int main() {
int array[1000][1000];
for(int j=0; j<1000; j++){
for(int i=0; i<1000; i++){
array[j][i] = i*j;
}
}
}
The only difference is that the loops have switched, so first it goes through j instead of i.
This happens because in C language it reads the cache in row-major order but the first int that increases is the column. It goes from [0][0] to [1][0] (See table below). So when it needs a value it gets the entire row for one value and the next row for the next value. This would be more effective if it got all the values from one row at once, this is why the loops are switched. It then counts from [0][0] to [0][1], so it can get 4 values from reading the cache once instead of one value per cache read.
-ftree-tail-merge
GCC Manual Description:
Look for identical code sequences. When found, replace one with a jump to the other. This optimization is known as tail merging or cross jumping. This flag is enabled by default at -O2 and higher. The compilation time in this pass can be limited using max-tail-merge-comparisons parameter and max-tail-merge-iterations parameter.
After researching this topic a lot I ended up writing the following code:
int main(int& a, int& b) {
int c = 0;
if(a > 100){
a += 100;
b += 200;
c = a + b;
} else {
a += 100;
b += 200;
c = a + b;
} return c;
}
As you can see, there's duplicate code in the if and else. In this simple example the developer should see that this can be changed, but to show you what -ftree-tail-merge does I kept it simple.
Sadly I couldn't get this to compile the right way either, because I probably missed some necessary flags.
After researching some more I found out that -tree-tail-merge and -fcrossjumping have the same output when it comes to the code, but the -ftree-tail-merge gets activated in a different intermediate language step during compiling which is called GIMPLE.
To show you what the difference in assembly code is, I used the -fcrossjumping flag because it gives the same results. I colour aligned the code and assembler to let you know where everything happens and what the difference is between the assembler code.
gcc -O
main(int&, int&):
movl (%rdi), %eax
cmpl $100, %eax
jle .L2
addl $100, %eax
movl %eax, (%rdi)
movl (%rsi), %eax
addl $200, %eax
movl %eax, (%rsi)
addl (%rdi), %eax
ret
.L2:
addl $300, %eax
movl %eax, (%rdi)
movl (%rsi), %eax
addl $200, %eax
movl %eax, (%rsi)
addl (%rdi), %eax
ret
gcc -O -fcrossjumping
main(int&, int&):
movl (%rdi), %eax
cmpl $100, %eax
jle .L2
addl $100, %eax
jmp .L4
.L2:
addl $300, %eax
.L4:
movl %eax, (%rdi)
movl (%rsi), %eax
addl $200, %eax
movl %eax, (%rsi)
addl (%rdi), %eax
ret
As you can see, it skips over the b and c operations in the if and jumps straight to L4 where the b and c operations are done. In code it would look like the following:
int main(int& a, int& b)
{
int c = 0;
if (a > 100) {
a += 100;
} else {
a += 300;
}
b += 200;
c = a + b;
return c;
}
Resources:
https://en.wikipedia.org/wiki/Loop_interchange
http://www.ucw.cz/~hubicka/papers/proj/node10.html
https://katecpp.wordpress.com/2015/09/19/cross-jumping/
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html