SPO600 Blog: February 2016

Lab 6 - Vectorization

The assignment for this lab is to fill two arrays with 1000 integers and add them together in a third array. The goal is to have the compiler recognize that it can use vectorization.
The code I wrote for this is:

#include <stdio.h>
#include <time.h>

int main(void) {
srand(time(NULL));
int arraySize = 1000;
int arrayOne[arraySize], arrayTwo[arraySize], arrayThree[arraySize];
for (int i = 0; i < arraySize; i++) {
arrayOne[i] = rand() % 1000;
arrayTwo[i] = rand() % 1000;
}

for (int j = 0; j < arraySize; j++) {
arrayThree[j] = arrayOne[j] + arrayTwo[j];
}

long sum = 0;
for (int k = 0; k < arraySize; k++) {
sum += arrayThree[k];
}
printf("%d\n", sum);
return 0;
}

To compile the code on archie we use the following compiler command:

gcc -x c vec.C -o vec -O3 -std=c99 -g

I have not yet tested the code on archie, and only on my own pc. So that's on my TO-DO list.

Lab 5 - Algorithm Selection Lab

The assignment for this lab was to write two algorithms for adjusting volume of sound samples. We did this lab in groups. The first method is to scale a signed 16-bit integer by multiplying it by a volume scaling factor expressed as a float in the range of 0.000 - 1.000.
The first thing to do is create an array and fill it with many random values which act like the sound samples. After this we can write the scaling function. This is as simple as multiplying the sound sample and the scaling factor. We do this for all the sounds samples in the array.

int main() { //creating an array of samples srand((int)time(NULL)); int numberOfElements = 500000000; int16_t* sampleArray = (int16_t*)malloc(sizeof(int16_t) * numberOfElements); fillSampleArray(sampleArray, numberOfElements);
int result = INT16_MIN;

    for (int i = 0; i < numberOfElements; i++) {         result = result ^ volumeScale(sampleArray[i], 0.750);
    }

    printf("%d", result);
    free(sampleArray);
    return 0;
}

void fillSampleArray(int16_t* arr, int numOfElems)
{
    for (int i = 0; i < numOfElems; i++)
    {
        arr[i] = ((rand() * 2) % (UINT16_MAX + 1)) + (INT16_MIN);
    }
}

int16_t volumeScale(int16_t sample, float scale)
{
return (int16_t)sample*scale;
}

The second method is to do the same thing but by using a lookup table. This means creating an array of all the possible values, which would be 65536 possibilities. One lookup table is for one volume scale.
So first off you check if the entered volume scale is the previous volume, and if this is not the case, you make the lookup table for this volume scale. Then you change the previous volume variable to the current scale. This way the lookup table is created once for every volume scale. So instead of doing 500 million multiplications like the first method, now it only take 65+k multiplications. Whenever the volume is changed, another lookup table will be generated for that volume.

Now to compare the two methods on ARCH64 and x86_64.
On x86_64(Xerxes) the lookup table is 55% slower that the normal multiplications.
On Arch64(Aarchie) the lookup table is 22% faster than the normal multiplications.

Lab 7 - Inline Assembler Lab

The assignment was to find inline assembler code in an open source project. I chose the open source project sooperlooper. I downloaded the sourcecode from their site: http://essej.net/sooperlooper/download.html

"SooperLooper is a live looping sampler capable of immediate loop recording, overdubbing, multiplying, reversing and more. It allows for multiple simultaneous multi-channel loops limited only by your computer's available memory."

I extracted the tar file and proceeded to find the inline assembler. I used the following command:
egrep -R "__asm__|asm\(*" *

This gave me all the usages of the terms __asm__ and asm(

The files where inline assembler was used were:
src/atomic.h
libs/pbd/pbd/atomic.h

Supported architecures
Time to take a closer look at the src/atomic.h file.
They support several kinds of architectures, I've listed them below as shown in the code.

__ASM_PPC_ATOMIC_H_ = PowerPC architecture
__i386__ || __x86_64__ / __ARCH_I386_ATOMIC__ = arch i386 architecture
__ARCH_SPARC_ATOMIC__ = arch sparc architecture
__ARCH_IA64_ATOMIC__ = arch ia64 architecture
__ALPHA_ATOMIC_H = alpha architecture
__ARCH_S390_ATOMIC__ = arch s390 architecture
__MIPS__ = mips architecture
__ARCH_M68K_ATOMIC__ = arch m68k architecture

They also have an option when it's none of the architectures listed above:
__NO_STRICT_ATOMIC

Assembler code
This code makes use of atomic operations to make looping more efficient than in normal C code.
The functions they made are:

atomic_add
atomic_sub
atomic_sub_and_test
atomic_inc
atomic_dec
atomic_dec_and_test
atomic_inc_and_test
atomic_add_negative

x86-specific

atomic_clear_mask
atomic_set_mask

To paste all the assembler code here would be a bit much, but I'll take the function atomic_add as an example.

/**
* atomic_add - add integer to atomic variable
* @i: integer value to add
* @v: pointer of type atomic_t
*
* Atomically adds @i to @v. Note that the guaranteed useful range
* of an atomic_t is only 24 bits.
*/
static __inline__ void atomic_add(int i, atomic_t *v)
{
    __asm__ __volatile__(
        SMP_LOCK "addl %1,%0"
        :"=m" (v->counter)
        :"ir" (i), "m" (v->counter));
}

They use __volatile__ here so the compiler is forced not to move the code around and keep the operations happening before this function before it, and whatever comes after this, stay after it.
So this atomic_add function basically just adds int i to the variable v.

Sunday, February 28, 2016

Lab 6 - Vectorization

Lab 5 - Algorithm Selection Lab

Lab 7 - Inline Assembler Lab