Monday, March 28, 2016

MD5Deep Installation

The project I chose to optimize is called md5deep. This is a program to compute hashes for a number of files. Md5deep doesn't only hash files with md5 but can also use SHA variants, Tiger and Whirlpool. It is similar to the md5sum program but md5deep has some additional features, which I quoted from: http://md5deep.sourceforge.net/
  • Recursive operation - md5deep is able to recursive examine an entire directory tree. That is, compute the MD5 for every file in a directory and for every file in every subdirectory.
  • Comparison mode - md5deep can accept a list of known hashes and compare them to a set of input files. The program can display either those input files that match the list of known hashes or those that do not match. Hashes sets can be drawn from Encase, the National Software Reference LibraryiLook InvestigatorHashkeepermd5sum, BSD md5, and other generic hash generating programs. Users are welcome to add functionality to read other formats too!
  • Time estimation - md5deep can produce a time estimate when it's processing very large files.
  • Piecewise hashing - Hash input files in arbitrary sized blocks
  • File type mode - md5deep can process only files of a certain type, such as regular files, block devices, etc.
I'll be taking a look at the source code and attempt to find optimization possibilities. The possibilities of optimizations are:
  • Improving a hashing algorithm
  • Add code so it works specifically on ARCH64, but not influencing the other systems
  • Edit the make file so it makes the program faster
I downloaded the source code from https://github.com/jessek/hashdeep/releases and ran the following commands to install it correctly:
sh bootstrap.sh
./configure
make
sudo make install
To test that it worked I hashed a few files and saved the hashes in a file. By saving the hashing you can find the matching files by using their comparison feature. The next step will be benchmarking.

Tuesday, March 15, 2016

Presentation Review

Today I presented my topics discussed in my previous blog: -ftree-tail-merge and -floop-interchange. I used a  powerpoint to get the information across, I will put the slides on the end of this post.

Overall I thought the presentation went alright. I was able to show some code examples and explain how the options work. Sadly I wasn't able to compile my code due to different reasons. First of all the -floop-interchange option can only be used when GCC is configured with the --with-ppland and --with-cloog option. Appearently the GCC compiler we're using was only configured with --with-cloog, so this explains why I wasn't able to spot differences in my assembler code.

For the -ftree-tail-merge I do not know what I did wrong while compiling, I just could not get results in the assembler code. A tip I got from a classmate was to add a limit flag where I set the amount of lines that have to be identical. I have yet to try this.

I think most of the class understood what I was saying because there weren't many questions and I don't think the options I chose were very difficult.

One thing I couldn't figure out was, what the difference was between -ftree-tail-merge and -fcrossjumping since they basically both get the same results. Appearently the difference is that the -ftree-tail-merge gets called in the intermediate language GIMPLE and the -fcrossjumping is called in a different compile step. In source code you wouldn't see any differences between these options.

Powerpoint slides

Monday, March 14, 2016

Presentation Topics

The topics I chose for my presentation are the following optimization options. I'll be explaining the options with code examples and not assembler examples since I couldn't compile my code with the right compiler flags.





-floop-interchange
GCC Manual Description:
Perform loop interchange transformations on loops. Interchanging two nested loops switches the inner and outer loops. For example, given a loop like:
          DO J = 1, M
            DO I = 1, N
              A(J, I) = A(J, I) * C
            ENDDO
          ENDDO
loop interchange transforms the loop as if it were written:
          DO I = 1, N
            DO J = 1, M
              A(J, I) = A(J, I) * C
            ENDDO
          ENDDO
which can be beneficial when N is larger than the caches, because in Fortran, the elements of an array are stored in memory contiguously by column, and the original loop iterates over rows, potentially creating at each access a cache miss. This optimization applies to all the languages supported by GCC and is not limited to Fortran. To use this code transformation, GCC has to be configured with --with-ppland --with-cloog to enable the Graphite loop transformation infrastructure.

So after reading the manual and the wiki page, I went right into writing code, which developed into the following:
int main() {
    int array[1000][1000];
    for(int i=0; i<1000; i++){
        for(int j=0; j<1000; j++){
            array[j][i] = i*j;
        }
    }
}

When I tried to compile this with the -floop-interchange option, I saw no differences in the assembler code. Later I figured out that the compiler we're using is not configured with the --with-ppland option. So it made sense that there was no assembler to show for it.

After this I decided to just show in code what would change when the option is enabled, which resulted into the following:
int main() {
    int array[1000][1000];
    for(int j=0; j<1000; j++){
        for(int i=0; i<1000; i++){
            array[j][i] = i*j;
        }
    }
}

The only difference is that the loops have switched, so first it goes through j instead of i.
This happens because in C language it reads the cache in row-major order but the first int that increases is the column. It goes from [0][0] to [1][0] (See table below). So when it needs a value it gets the entire row for one value and the next row for the next value. This would be more effective if it got all the values from one row at once, this is why the loops are switched. It then counts from [0][0] to [0][1], so it can get 4 values from reading the cache once instead of one value per cache read.







-ftree-tail-merge
GCC Manual Description:
Look for identical code sequences. When found, replace one with a jump to the other. This optimization is known as tail merging or cross jumping. This flag is enabled by default at -O2 and higher. The compilation time in this pass can be limited using max-tail-merge-comparisons parameter and max-tail-merge-iterations parameter.


After researching this topic a lot I ended up writing the following code:
int main(int& a, int& b) {
    int c = 0;
    if(a > 100){
        a += 100;
        b += 200;
        c = a + b;
    } else {
        a += 100;
        b += 200;
        c = a + b;
    }    return c;
}

As you can see, there's duplicate code in the if and else. In this simple example the developer should see that this can be changed, but to show you what -ftree-tail-merge does I kept it simple.
Sadly I couldn't get this to compile the right way either, because I probably missed some necessary flags.
After researching some more I found out that -tree-tail-merge and -fcrossjumping have the same output when it comes to the code, but the -ftree-tail-merge gets activated in a different intermediate language step during compiling which is called GIMPLE.
To show you what the difference in assembly code is, I used the -fcrossjumping flag because it gives the same results. I colour aligned the code and assembler to let you know where everything happens and what the difference is between the assembler code.

gcc -O
main(int&, int&):
        movl    (%rdi), %eax
        cmpl    $100, %eax
        jle     .L2
        addl    $100, %eax
        movl    %eax, (%rdi)
        movl    (%rsi), %eax
        addl    $200, %eax
        movl    %eax, (%rsi)
        addl    (%rdi), %eax
        ret
.L2:
        addl    $300, %eax
        movl    %eax, (%rdi)
        movl    (%rsi), %eax
        addl    $200, %eax
        movl    %eax, (%rsi)
        addl    (%rdi), %eax
        ret

gcc -O -fcrossjumping
main(int&, int&):
        movl    (%rdi), %eax
        cmpl    $100, %eax
        jle     .L2
        addl    $100, %eax
        jmp     .L4
.L2:
        addl    $300, %eax
.L4:
        movl    %eax, (%rdi)
        movl    (%rsi), %eax
        addl    $200, %eax
        movl    %eax, (%rsi)
        addl    (%rdi), %eax
        ret

As you can see, it skips over the b and c operations in the if and jumps straight to L4 where the b and c operations are done. In code it would look like the following:
int main(int& a, int& b)
{
   int c = 0;
   if (a > 100) {
      a += 100;
   } else {
      a += 300;
   }

   b += 200;
   c = a + b;

   return c;
}

Resources:
https://en.wikipedia.org/wiki/Loop_interchange
http://www.ucw.cz/~hubicka/papers/proj/node10.html
https://katecpp.wordpress.com/2015/09/19/cross-jumping/
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html


Sunday, February 28, 2016

Lab 6 - Vectorization

The assignment for this lab is to fill two arrays with 1000 integers and add them together in a third array. The goal is to have the compiler recognize that it can use vectorization.
The code I wrote for this is:
#include <stdio.h>
#include <time.h> 
int main(void) {
srand(time(NULL));
    int arraySize = 1000;
    int arrayOne[arraySize], arrayTwo[arraySize], arrayThree[arraySize];
    for (int i = 0; i < arraySize; i++) {
        arrayOne[i] = rand() % 1000;
        arrayTwo[i] = rand() % 1000;
    }
 
    for (int j = 0; j < arraySize; j++) {
        arrayThree[j] = arrayOne[j] + arrayTwo[j];
    }
 
    long sum = 0;
    for (int k = 0; k < arraySize; k++) {
        sum += arrayThree[k];
    }
    printf("%d\n", sum);
    return 0;
}
To compile the code on archie we use the following compiler command:
gcc -x c vec.C -o vec -O3 -std=c99 -g
 I have not yet tested the code on archie, and only on my own pc. So that's on my TO-DO list.

Lab 5 - Algorithm Selection Lab

The assignment for this lab was to write two algorithms for adjusting volume of sound samples. We did this lab in groups. The first method is to scale a signed 16-bit integer by multiplying it by a volume scaling factor expressed as a float in the range of 0.000 - 1.000.
The first thing to do is create an array and fill it with many random values which act like the sound samples. After this we can write the scaling function. This is as simple as multiplying the sound sample and the scaling factor. We do this for all the sounds samples in the array.
int main()     //creating an array of samples     srand((int)time(NULL));     int numberOfElements = 500000000;     int16_t* sampleArray = (int16_t*)malloc(sizeof(int16_t) * numberOfElements);     fillSampleArray(sampleArray, numberOfElements); 
    int result = INT16_MIN;
    for (int i = 0; i < numberOfElements; i++) {         result = result ^  volumeScale(sampleArray[i], 0.750);
    }

    printf("%d", result);
    free(sampleArray);
    return 0;
}

void fillSampleArray(int16_t* arr, int numOfElems)
{
    for (int i = 0; i < numOfElems; i++)
    {
        arr[i] = ((rand() * 2) % (UINT16_MAX + 1)) + (INT16_MIN);
    }
}
int16_t volumeScale(int16_t sample, float scale)
{
    return (int16_t)sample*scale;
}

The second method is to do the same thing but by using a lookup table. This means creating an array of all the possible values, which would be 65536 possibilities. One lookup table is for one volume scale.
So first off you check if the entered volume scale is the previous volume, and if this is not the case, you make the lookup table for this volume scale. Then you change the previous volume variable to the current scale. This way the lookup table is created once for every volume scale. So instead of doing 500 million multiplications like the first method, now it only take 65+k multiplications. Whenever the volume is changed, another lookup table will be generated for that volume.

Now to compare the two methods on ARCH64 and x86_64.
On x86_64(Xerxes) the lookup table is 55% slower that the normal multiplications.
On Arch64(Aarchie) the lookup table is 22% faster than the normal multiplications.


Lab 7 - Inline Assembler Lab

The assignment was to find inline assembler code in an open source project. I chose the open source project sooperlooper. I downloaded the sourcecode from their site: http://essej.net/sooperlooper/download.html

"SooperLooper is a live looping sampler capable of immediate loop recording, overdubbing, multiplying, reversing and more. It allows for multiple simultaneous multi-channel loops limited only by your computer's available memory."

I extracted the tar file and proceeded to find the inline assembler. I used the following command:
     egrep -R "__asm__|asm\(*" *

This gave me all the usages of the terms __asm__ and asm(

The files where inline assembler was used were:
src/atomic.h
libs/pbd/pbd/atomic.h

Supported architecures
Time to take a closer look at the src/atomic.h file.
They support several kinds of architectures, I've listed them below as shown in the code.

__ASM_PPC_ATOMIC_H_   = PowerPC architecture
__i386__ || __x86_64__ / __ARCH_I386_ATOMIC__ = arch i386 architecture
__ARCH_SPARC_ATOMIC__ = arch sparc architecture
__ARCH_IA64_ATOMIC__ = arch ia64 architecture
__ALPHA_ATOMIC_H = alpha architecture
__ARCH_S390_ATOMIC__ = arch s390 architecture
__MIPS__ = mips architecture
__ARCH_M68K_ATOMIC__ = arch m68k architecture

They also have an option when it's none of the architectures listed above:
__NO_STRICT_ATOMIC

Assembler code
This code makes use of atomic operations to make looping more efficient than in normal C code.
The functions they made are:
  • atomic_add
  • atomic_sub
  • atomic_sub_and_test
  • atomic_inc
  • atomic_dec
  • atomic_dec_and_test
  • atomic_inc_and_test
  • atomic_add_negative
x86-specific
  • atomic_clear_mask
  • atomic_set_mask
To paste all the assembler code here would be a bit much, but I'll take the function atomic_add as an example. 
/**
 * atomic_add - add integer to atomic variable
 * @i: integer value to add
 * @v: pointer of type atomic_t
 *
 * Atomically adds @i to @v.  Note that the guaranteed useful range
 * of an atomic_t is only 24 bits.
 */
static __inline__ void atomic_add(int i, atomic_t *v)
{
    __asm__ __volatile__(
        SMP_LOCK "addl %1,%0"
        :"=m" (v->counter)
        :"ir" (i), "m" (v->counter));
}
They use __volatile__ here so the compiler is forced not to move the code around and keep the operations happening before this function before it, and whatever comes after this, stay after it.
So this atomic_add function basically just adds int i to the variable v.







Sunday, January 31, 2016

Lab 1 - Code Review

The purpose of this lab is to explore code review processes used in open source projects. We had to select two open source project that have different licenses.

Amarok
The first project I looked at was called Amarok. This is a iTunes like music player for Linux, Unix and Windows. https://amarok.kde.org/. This project is licensed under the GNU General Public License.

There are a few ways you can contribute to this project:
  • Create artwork for the application
  • Write code in C++ using the Qt and KDELibs toolkits
  • Write scripts that expand the application
To start contributing they recommend you take a look at the bugs and see if you can help with any bugs. Their bugtrackingsystem can be found at: https://bugs.kde.org/buglist.cgi?quicksearch=amarok

All the code that you write must be submitted to their reviewboard which can be found at https://git.reviewboard.kde.org/groups/amarok/?show-closed=1&page=1.
On this reviewboard all the members can comment on your contribution. However there are only a few members with commit rights. So when members comment that the code looks good and would be a nice addition, they'll most likely comment that someone with commit access should push this to the master branch. An example I found: https://git.reviewboard.kde.org/r/120930/#review69630

A more recent example of an actual submission: https://git.reviewboard.kde.org/r/2/
Here you can see that normal members comment on it and say to ship it, and a member with commit access commits the code. 

Chromium
The second project I looked at was Chromium, an open-source browser behind google chrome: http://www.chromium.org/Home. This project is licesned under the Creative Commons Attribution 2.5 License.

There are three major paths you can take to contribute to Chromium:
  • You have a new idea
  • You want to change exicting code
  • You want to work on an existing bug
When you have a new idea you want to add to Chromium you first have to post your idea to a discussion group. In this group they'll discuss if your idea if worth adding to the application. I couldn't find an example of this, because their discussion group is also for many other problems. It can be found here: https://groups.google.com/a/chromium.org/forum/#!forum/chromium-discuss

If you want to change existing code, you have to contact the person who originally wrote the code and ask if your idea would be good to add. 

Not all bugs in their system are assigned, so if it's not assigned youre allowed to pick it up and assign it to yourself. If the bug you want is already assigned you can contact the person it's assigned to and ask if you could help or take over. It's also possible to file your own bug and work on it, but these bugs cannot be whole new ideas. These bugs have to be nontrivial, like simple cleanups and style fixes. The listed bugs can be found here: https://code.google.com/p/chromium/issues/list

To submit your code you have to add it to the code review site: https://codereview.chromium.org/ 
Your code will be reviewd within 24-48 hours, if the patch is not too big. All members can comment on your code and give you feedback. Only some members have commit access. An exmaple I found: https://codereview.chromium.org/1650283002/

Overall Chromuim takes communication very seriously, if you're working on something you need to keep communicating to relevant people otherwise you might have a hard time getting your code reviewed.