Aligned Functions and libSIMDx86

Aligned functions are not functions that are aligned, but rather, functions that require all pointer arguments to be aligned to the 16th byte or unpredictable results will occur. These unpredictable results can range from normal functioning, to incorrect results, to program crashes. Debugging such a problem is also very difficult.

These functions exist because many SIMD instructions are accelerated by aligned data, and have special instructions that execute faster, but require aligned pointers. Performance increases can be fairly dramatic, especially with Intel's SSE/SSE2/SSE3 instruction set and cases of 110% performance increases have been recorded in practice. Later, when batch processing functions are added, aligned versions of the functions will easily outperform the unaligned versions, not only because of the use of faster instructions, but since many times, it is possible to use less registers with a load-execute style of code rather than load-store-execute. This also improves code density and helps the aggressive instruction schedulers do their job.

The basic way to convert regular code to the aligned function (if they exist) is merely to add “Aligned” to the prefix of the function, right after the underbar:


SIMDx86Vector_Normalize() -> SIMDx86Vector_AlignedNormalize()


An address is aligned to the 16th byte if the first four bits are all zero:


/* x86, 32 bit architectures */

int Is16ByteAligned(void* address)

{

unsigned int addr = (unsigned int)address;

return ((addr & 0x0F) == addr);

}


/* x86, 64 bit architectures */

int Is16ByteAligned(void* address)

{

unsigned long addr = (unsigned long)address;

return ((addr & 0x0F) == addr);

}




Although it seems preferable to have code that looks like:


if(Is16ByteAligned(MyVector))

SIMDx86Vector_AlignedNormalize(MyVector);

else

SIMDx86Vector_Normalize(MyVector);


...do not give in. The overhead of comparisons and branches can outweigh the benefit gained in many cases. However, when batch functions are introduced, this will be preferable since the cost of a single branch will be amortized over a large set of data. Image transforming 1,000,000 verticies. It only takes one segment of code to pick whether to use an aligned function or an unaligned function. Since the time taken to determine if the address is aligned is greatly outweighed by the potential of using the aligned function, the check should be made.

While global and stack data can usually be aligned by merely telling the compiler to do so, there has been a lot of questions about pointers returned from malloc(), and new. First off, these functions are NOT required to return a pointer aligned to the 16th byte. In order to enforce such alignment, a custom function needs to be written.