Tuesday, March 29, 2016

MD5deep focus area

So my first thought on the focus area was implementing a way for aarch64 to be installed correctly since I ran into trouble with the architecture. But in the latest configure file I had to download they fixed that issue already and was dated 2016-03-24, so that was pretty recent.

Since I'm more familiar with code than all the configuration and make files, I decided to take a better look at the source code.

MD5deep benchmarking betty

Now that I'm done testing and benchmarking on my own x86_64 fedora system, I need to test md5deep on an AARCH64 system. I transferred the files I used in the x86_64 benchmark to the betty server using scp and sftp. Once I got the test files transferred I transferred the tar.gz file of md5deep so I could try to install it on betty. Here's where I ran into some issues...

The normal way to install the md5deep is to unpack it and run the following commands:
sh bootstrap.sh
sudo ./configure
make
make install 
Sadly it stopped at the configure step. It could not recognize which build to use since it did not know aarch64. Luckily in the error it gave me a website to download the most recent configure files which did contain aarch64! (ftp://ftp.gnu.org/pub/gnu/config/) So after some more use of sftp the configure step finally worked and I could run the make install command.

On to the benchmarking. Just like benchmarking on the x86_64 I ran the md5deep program 100 times with 3 different file sizes. The results are as following:
10.5 mb file
real: 0m0.037s
user: 0m0.028s
sys: 0m0.001s
105 mb file
real: 0m0.345s
user: 0m0.323s
sys: 0m0.027s
1.5 gb file
real: 0m4.792s
user: 0m4.551s
sys: 0m0.400s
If you only have one file md5deep can hash it pretty quickly. Even if you hash entire directory structures you wouldn't have to wait too long.

MD5deep benchmarking x86_64

I currently only have md5deep installed on my own fedora workstation, so I could test if I could get it working. Now that I did get it to work I can benchmark it for my laptop. When I get this to work I will do the same on betty.

My fedora specifications are:
Memory: 2.9 GiB
Processor: Intel Core i7-3630QM CPU @ 2.40GHz
Machine architecture: x86_64

To benchmark this program I need a few files of different sizes. With the following command I can create a file with complete random data and a size of approximately 100mb.
dd if=/dev/random of=test100m bs=1M count=100 iflag=fullblock
After running this command for different file sizes, I ended up with a 105mb testfile and a 10.5mb file.
By running the following command you can run the md5deep program and see how long it took to complete.
time md5deep test100m
To get a presentabele benchmark you have to run this command a lot of times and get the average. Instead of manually running this command 100 times I wrote the following command which would do it for me:
time for i in {1..100}; do md5deep test100m; done
The time you get for this is the total time of running it 100 times, so the time result you get has to be divided by 100 to get the average time of running once. I got the following results:
10.5 mb file
real: 0m0.053s
user: 0m0.021s
sys: 0m0.003s
105 mb file
real: 0m0.287s
user: 0m0.200s
sys: 0m0.012s

After seeing that these files still get hashed pretty quickly I decided to download a bigger file to see how long that would take. I downloaded the Fedora Workstation iso, which is 1.5gb, to test with.
1.5gb file
real: 0m3.458s
user: 0m2.793s
sys: 0m0.184s
It still doesn't take very long for a 1.5gb file to be hashed, so this program is pretty fast.



Monday, March 28, 2016

MD5Deep Installation

The project I chose to optimize is called md5deep. This is a program to compute hashes for a number of files. Md5deep doesn't only hash files with md5 but can also use SHA variants, Tiger and Whirlpool. It is similar to the md5sum program but md5deep has some additional features, which I quoted from: http://md5deep.sourceforge.net/
  • Recursive operation - md5deep is able to recursive examine an entire directory tree. That is, compute the MD5 for every file in a directory and for every file in every subdirectory.
  • Comparison mode - md5deep can accept a list of known hashes and compare them to a set of input files. The program can display either those input files that match the list of known hashes or those that do not match. Hashes sets can be drawn from Encase, the National Software Reference LibraryiLook InvestigatorHashkeepermd5sum, BSD md5, and other generic hash generating programs. Users are welcome to add functionality to read other formats too!
  • Time estimation - md5deep can produce a time estimate when it's processing very large files.
  • Piecewise hashing - Hash input files in arbitrary sized blocks
  • File type mode - md5deep can process only files of a certain type, such as regular files, block devices, etc.
I'll be taking a look at the source code and attempt to find optimization possibilities. The possibilities of optimizations are:
  • Improving a hashing algorithm
  • Add code so it works specifically on ARCH64, but not influencing the other systems
  • Edit the make file so it makes the program faster
I downloaded the source code from https://github.com/jessek/hashdeep/releases and ran the following commands to install it correctly:
sh bootstrap.sh
./configure
make
sudo make install
To test that it worked I hashed a few files and saved the hashes in a file. By saving the hashing you can find the matching files by using their comparison feature. The next step will be benchmarking.

Tuesday, March 15, 2016

Presentation Review

Today I presented my topics discussed in my previous blog: -ftree-tail-merge and -floop-interchange. I used a  powerpoint to get the information across, I will put the slides on the end of this post.

Overall I thought the presentation went alright. I was able to show some code examples and explain how the options work. Sadly I wasn't able to compile my code due to different reasons. First of all the -floop-interchange option can only be used when GCC is configured with the --with-ppland and --with-cloog option. Appearently the GCC compiler we're using was only configured with --with-cloog, so this explains why I wasn't able to spot differences in my assembler code.

For the -ftree-tail-merge I do not know what I did wrong while compiling, I just could not get results in the assembler code. A tip I got from a classmate was to add a limit flag where I set the amount of lines that have to be identical. I have yet to try this.

I think most of the class understood what I was saying because there weren't many questions and I don't think the options I chose were very difficult.

One thing I couldn't figure out was, what the difference was between -ftree-tail-merge and -fcrossjumping since they basically both get the same results. Appearently the difference is that the -ftree-tail-merge gets called in the intermediate language GIMPLE and the -fcrossjumping is called in a different compile step. In source code you wouldn't see any differences between these options.

Powerpoint slides

Monday, March 14, 2016

Presentation Topics

The topics I chose for my presentation are the following optimization options. I'll be explaining the options with code examples and not assembler examples since I couldn't compile my code with the right compiler flags.





-floop-interchange
GCC Manual Description:
Perform loop interchange transformations on loops. Interchanging two nested loops switches the inner and outer loops. For example, given a loop like:
          DO J = 1, M
            DO I = 1, N
              A(J, I) = A(J, I) * C
            ENDDO
          ENDDO
loop interchange transforms the loop as if it were written:
          DO I = 1, N
            DO J = 1, M
              A(J, I) = A(J, I) * C
            ENDDO
          ENDDO
which can be beneficial when N is larger than the caches, because in Fortran, the elements of an array are stored in memory contiguously by column, and the original loop iterates over rows, potentially creating at each access a cache miss. This optimization applies to all the languages supported by GCC and is not limited to Fortran. To use this code transformation, GCC has to be configured with --with-ppland --with-cloog to enable the Graphite loop transformation infrastructure.

So after reading the manual and the wiki page, I went right into writing code, which developed into the following:
int main() {
    int array[1000][1000];
    for(int i=0; i<1000; i++){
        for(int j=0; j<1000; j++){
            array[j][i] = i*j;
        }
    }
}

When I tried to compile this with the -floop-interchange option, I saw no differences in the assembler code. Later I figured out that the compiler we're using is not configured with the --with-ppland option. So it made sense that there was no assembler to show for it.

After this I decided to just show in code what would change when the option is enabled, which resulted into the following:
int main() {
    int array[1000][1000];
    for(int j=0; j<1000; j++){
        for(int i=0; i<1000; i++){
            array[j][i] = i*j;
        }
    }
}

The only difference is that the loops have switched, so first it goes through j instead of i.
This happens because in C language it reads the cache in row-major order but the first int that increases is the column. It goes from [0][0] to [1][0] (See table below). So when it needs a value it gets the entire row for one value and the next row for the next value. This would be more effective if it got all the values from one row at once, this is why the loops are switched. It then counts from [0][0] to [0][1], so it can get 4 values from reading the cache once instead of one value per cache read.







-ftree-tail-merge
GCC Manual Description:
Look for identical code sequences. When found, replace one with a jump to the other. This optimization is known as tail merging or cross jumping. This flag is enabled by default at -O2 and higher. The compilation time in this pass can be limited using max-tail-merge-comparisons parameter and max-tail-merge-iterations parameter.


After researching this topic a lot I ended up writing the following code:
int main(int& a, int& b) {
    int c = 0;
    if(a > 100){
        a += 100;
        b += 200;
        c = a + b;
    } else {
        a += 100;
        b += 200;
        c = a + b;
    }    return c;
}

As you can see, there's duplicate code in the if and else. In this simple example the developer should see that this can be changed, but to show you what -ftree-tail-merge does I kept it simple.
Sadly I couldn't get this to compile the right way either, because I probably missed some necessary flags.
After researching some more I found out that -tree-tail-merge and -fcrossjumping have the same output when it comes to the code, but the -ftree-tail-merge gets activated in a different intermediate language step during compiling which is called GIMPLE.
To show you what the difference in assembly code is, I used the -fcrossjumping flag because it gives the same results. I colour aligned the code and assembler to let you know where everything happens and what the difference is between the assembler code.

gcc -O
main(int&, int&):
        movl    (%rdi), %eax
        cmpl    $100, %eax
        jle     .L2
        addl    $100, %eax
        movl    %eax, (%rdi)
        movl    (%rsi), %eax
        addl    $200, %eax
        movl    %eax, (%rsi)
        addl    (%rdi), %eax
        ret
.L2:
        addl    $300, %eax
        movl    %eax, (%rdi)
        movl    (%rsi), %eax
        addl    $200, %eax
        movl    %eax, (%rsi)
        addl    (%rdi), %eax
        ret

gcc -O -fcrossjumping
main(int&, int&):
        movl    (%rdi), %eax
        cmpl    $100, %eax
        jle     .L2
        addl    $100, %eax
        jmp     .L4
.L2:
        addl    $300, %eax
.L4:
        movl    %eax, (%rdi)
        movl    (%rsi), %eax
        addl    $200, %eax
        movl    %eax, (%rsi)
        addl    (%rdi), %eax
        ret

As you can see, it skips over the b and c operations in the if and jumps straight to L4 where the b and c operations are done. In code it would look like the following:
int main(int& a, int& b)
{
   int c = 0;
   if (a > 100) {
      a += 100;
   } else {
      a += 300;
   }

   b += 200;
   c = a + b;

   return c;
}

Resources:
https://en.wikipedia.org/wiki/Loop_interchange
http://www.ucw.cz/~hubicka/papers/proj/node10.html
https://katecpp.wordpress.com/2015/09/19/cross-jumping/
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html