-floop-interchange
GCC Manual Description:
DO J = 1, M
DO I = 1, N
A(J, I) = A(J, I) * C
ENDDO
ENDDO
loop interchange transforms the loop as if it were written: DO I = 1, N
DO J = 1, M
A(J, I) = A(J, I) * C
ENDDO
ENDDO
which can be beneficial when N
is larger than the caches, because in Fortran, the elements of an array are stored in memory contiguously by column, and the original loop iterates over rows, potentially creating at each access a cache miss. This optimization applies to all the languages supported by GCC and is not limited to Fortran. To use this code transformation, GCC has to be configured with --with-ppland --with-cloog to enable the Graphite loop transformation infrastructure.So after reading the manual and the wiki page, I went right into writing code, which developed into the following:
int main() { int array[1000][1000]; for(int i=0; i<1000; i++){ for(int j=0; j<1000; j++){ array[j][i] = i*j; } } }
After this I decided to just show in code what would change when the option is enabled, which resulted into the following:
int main() { int array[1000][1000]; for(int j=0; j<1000; j++){ for(int i=0; i<1000; i++){ array[j][i] = i*j; } } }
The only difference is that the loops have switched, so first it goes through j instead of i.
This happens because in C language it reads the cache in row-major order but the first int that increases is the column. It goes from [0][0] to [1][0] (See table below). So when it needs a value it gets the entire row for one value and the next row for the next value. This would be more effective if it got all the values from one row at once, this is why the loops are switched. It then counts from [0][0] to [0][1], so it can get 4 values from reading the cache once instead of one value per cache read.
-ftree-tail-merge
GCC Manual Description:
After researching this topic a lot I ended up writing the following code:
int main(int& a, int& b) { int c = 0; if(a > 100){ a += 100; b += 200; c = a + b; } else { a += 100; b += 200; c = a + b; } return c; }
As you can see, there's duplicate code in the if and else. In this simple example the developer should see that this can be changed, but to show you what -ftree-tail-merge does I kept it simple.
Sadly I couldn't get this to compile the right way either, because I probably missed some necessary flags.
After researching some more I found out that -tree-tail-merge and -fcrossjumping have the same output when it comes to the code, but the -ftree-tail-merge gets activated in a different intermediate language step during compiling which is called GIMPLE.
To show you what the difference in assembly code is, I used the -fcrossjumping flag because it gives the same results. I colour aligned the code and assembler to let you know where everything happens and what the difference is between the assembler code.
gcc -O
main(int&, int&):
movl (%rdi), %eax
cmpl $100, %eax
jle .L2
addl $100, %eax
movl %eax, (%rdi)
movl (%rsi), %eax
addl $200, %eax
movl %eax, (%rsi)
addl (%rdi), %eax
ret
.L2:
addl $300, %eax
movl %eax, (%rdi)
movl (%rsi), %eax
addl $200, %eax
movl %eax, (%rsi)
addl (%rdi), %eax
ret
gcc -O -fcrossjumping
main(int&, int&):
movl (%rdi), %eax
cmpl $100, %eax
jle .L2
addl $100, %eax
jmp .L4
.L2:
addl $300, %eax
.L4:
movl %eax, (%rdi)
movl (%rsi), %eax
addl $200, %eax
movl %eax, (%rsi)
addl (%rdi), %eax
ret
As you can see, it skips over the b and c operations in the if and jumps straight to L4 where the b and c operations are done. In code it would look like the following:
int main(int& a, int& b) { int c = 0; if (a > 100) { a += 100; } else { a += 300; } b += 200; c = a + b; return c; }
Resources:
https://en.wikipedia.org/wiki/Loop_interchange
http://www.ucw.cz/~hubicka/papers/proj/node10.html
https://katecpp.wordpress.com/2015/09/19/cross-jumping/
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
No comments:
Post a Comment