Thursday, April 21, 2016

Trying to get my changes upstream

As noted before my optimization was to change the compiler flag from -O2 to -O3. This increased the speed with 11% on x86_64. During the testfase of this optimization I changed the compiler flags in the Makefile itself. If I wanted this to get committed in the community I'd have to find the place where the makefiles are being created.

This would normally be the configure file. However I could not find where the flags were set for the makefiles which I thought was very strange because I am very sure that when you configure and make the program, it compiles with the -O2 flag.

I used grep to find files where -O2 would be used, and the only file it found was an instructions file of how you can manually add -O2 while configuring, not as a standard.

Then I tried using grep to find CFLAGS and where they would be defined. What I discovered is that they use a pthread library which helps find the right flags for compilation. (https://manned.org/pthread-config/b1cf9868). I quoted this from the md5deep documentation:
#   This macro figures out how to build C programs using POSIX threads. It
#   sets the PTHREAD_LIBS output variable to the threads library and linker
#   flags, and the PTHREAD_CFLAGS output variable to any special C compiler
#   flags that are needed.
I did not know how to manipulate the pthread configuration so it would always use -O3. I did read in their instructions that they had an email address for questions or remarks on the compiler options in the README file. It was however not in the file, so I could not contact them personally either.

On that note I'm sad to share that I could not get the right flags in the configure step. This means I could not commit anything to the github project because I could not contact them and ask for help or an explanation.

Sunday, April 10, 2016

Compiling Optimization Betty

After benchmarking x86_64 with the -O3 optimization flag, it's time to test this on ARCH64.
Since I can only work from the command line on the server I needed to come up with a command to replace all Makefiles' -O2 with -O3. The command I found the easiest was the following:
find -name Makefile | xargs sed -i 's/-O2/-O3/' 
The following benchmarks are done on ARCH64, server Betty.

With -O2 flag
10.5 mb file 105 mb file 1.5 gb file
real: 0m0.037s real: 0m0.345s real: 0m4.792s
user: 0m0.028s user: 0m0.323s user: 0m4.551s
sys: 0m0.001s sys: 0m0.027s sys: 0m0.400s

With -O3 flag
10.5 mb file 105 mb file 1.5 gb file
real: 0m0.036s real: 0m0.343s real: 0m4.768s
user: 0m0.028s user: 0m0.323s user: 0m4.499s
sys: 0m0.001s sys: 0m0.027s sys: 0m0.426s

As you can see the -O3 did not do anything on ARCH64. I thought this was very strange and checked the executable file to see if it changed at all. The file did change, it got larger as expected. Yet there is no gain in speed. Comparing the real time of the 1.5gb file again, it wasn't even 1% faster. So for ARCH64 I recommend using -O2 because it doesn't change much in run time and your file is smaller.

For Betty I'll have to find another optimization possibility, although I wouldn't know what the next possibility would be. Probably make something specifically for ARCH64, but this would cost way more time.

Another way to go is add the compiler flag -fopt-info-missed to find missed optimizations and see if I can do something about that. Source: https://gcc.gnu.org/onlinedocs/gccint/Dump-examples.html


Compiling Optimization x86_64

As mentioned in the previous post I am going to take a look at the compiler flags and how I might be able to optimize them.

The currents flags for the C compiler that are used:
-pthread -g -pg -O2 -MD -D_FORTIFY_SOURCE=2 -Wpointer-arith -Wmissing-declarations -Wmissing-prototypes -Wshadow -Wwrite-strings -Wcast-align -Waggregate-return -Wbad-function-cast -Wcast-qual -Wundef -Wredundant-decls -Wdisabled-optimization -Wfloat-equal -Wmissing-format-attribute -Wmultichar -Wc++-compat -Wmissing-noreturn -funit-at-a-time -Wall -Wstrict-prototypes -Weffc++
As you can see there are a lot of flags to compile this program with the C compiler. What caught my eye was that they're using the O2 flag and not O3, so I saw this as a optimization possibility and went straight in to change it. Once I changed all the flags I did another benchmark of the program to see if it had helped at all. ~

To check if it did anything, I looked at the executable file. It made the file larger with the -O3 optimization. It went from 2.7 mb to 3.5 mb.

Note: Don't forget to take out the -pg flag as well to have a correct benchmark. -pg makes your program slower so then you won't have the correct comparison.

The following benchmarks are for x86_64

With -O2 flag
10.5 mb file 105 mb file 1.5 gb file
real: 0m0.053s real: 0m0.287s real: 0m3.458s
user: 0m0.021s user: 0m0.200s user: 0m2.793s
sys: 0m0.003s sys: 0m0.012s sys: 0m0.184s

With -O3 flag
10.5 mb file 105 mb file 1.5 gb file
real: 0m0.045s real: 0m0.258s real: 0m3.071s
user: 0m0.020s user: 0m0.197s user: 0m2.756s
sys: 0m0.002s sys: 0m0.014s sys: 0m0.177s

By comparing the real time you can see the -O3 flag makes the program 11.2% faster. Which is a pretty nice optimization. So I recommend they replace the -O2 flag with -O3.

Now it's time to benchmark the software on Betty and hope so see an optimization as well. 


G-Profiling

To enable g-profiling you need to add the -pg flag to the compiler options. These options can be found in the Makefile. Just look for the terms "CFLAGS", "CPPFLAGS", and "LDFLAGS". Behing these options add the -pg flag and g-profiling should be enabled.

At first I couldn't figure out why my software wasn't producing the gmon.out file, which gprof generates for you. I added -pg to all the flags in the head Makefile. But apparently there were way more Makefiles where I needed to change the flags. There were 6 Makefiles in total, so I added -pg to all the compiler flags.

Luckily this did generate the gmon.out file so I could take a look at the profiling by running the following command:
gprof md5deep > gprof.txt
This generates a readable file which contains the profiling. The strange thing was that my gprof file said that every function it called took 0 seconds, even though it said that the function got called 8 times. It probably went through a lot of functions quickly and didn't register time.

The function that gets called 8 times is hash_final_sha1 which looks like this:
void hash_final_sha1(void * ctx, unsigned char *sum) { 
sha1_finish((sha1_context *)ctx.sum); } 
Since it's a one liner there isn't much I can optimize here. But it does call other functions which I can take a look at. I went through all the functions that call each other until I found a function that actually did things by itself instead of just forwarding to another function. I ended up at the following function:
void sha1_update( sha1_context *ctx, const unsigned char *input, size_t ilen )
{
    size_t fill;
    unsigned long left;

    if( ilen <= 0 )
        return;

    left = ctx->total[0] & 0x3F;
    fill = 64 - left;

    ctx->total[0] += (unsigned long) ilen;
    ctx->total[0] &= 0xFFFFFFFF;

    if( ctx->total[0] < (unsigned long) ilen )
        ctx->total[1]++;

    if( left && ilen >= fill )
    {
        memcpy( (void *) (ctx->buffer + left),
                (const void *) input, fill );
        sha1_process( ctx, ctx->buffer );
        input += fill;
        ilen  -= fill;
        left = 0;
    }

    while( ilen >= 64 )
    {
        sha1_process( ctx, input );
        input += 64;
        ilen  -= 64;
    }

    if( ilen > 0 )
    {
        memcpy( (void *) (ctx->buffer + left),
                (const void *) input, ilen );
    }
}
Looking at it I could not find a possible way to optimize this function. Since it took a while to find a suitable function, and I couldn't find an optimization, I'm going to look into the compiler flags next since they use the O2 flag, and this should be able to change to O3.