Road Map for libSIMDx86

Conversion To Separate Assembly Language Files

One of the biggest problems with libSIMDx86 is that the code is done with inline assembly language rather than out of line assembly. This means that libSIMDx86 really only works with the GNU C compiler, which understands the syntax. Switching to NASM or YASM may be required.

Code Overlays

Rather than statically link to the library for each processor, it would be a huge win to allow code overlays. Basically, for each function, there is a large buffer space that represents the code that will actually be executed when the function is called. A call would initialize the library, selecting a specific instruction set ,or detecting the best, depending on the parameter. Then the initialization function would go and copy the optimal code into the buffer space so that when the function is called, the copied code is executed. This requires some operating system specific things. For example, most code segments on x86 operating systems are marked as read-execute, meaning that they may be read, and executed, but an attempt to write to the data crashes the program. Using an operating specific interface to mark the code as read-write-execute temporarily so that it can be written to would be needed. In Win32 environments this would be a call to VirtualProtect(), and on (most) UNIX environments (including Linux), this would be a call to mprotect(). After the code is overlaid, the segment would then be marked with whatever original value it had before. Aside the instruction set choosing, any runtime decision can be made. For example, Athlon64 processors like a movlps/movhps pair to move in 16 byte unaligned data and suffer a penalty for using movups, while Intel processors can use movups, but suffer a performance penalty for the movlps/movhps pair. By choosing the “Intel optimized” path for Intel processors and the “AMD optimized” for AMD processors, this library could be even better.

Version Specific Goals/Plans

Version v0.4

This release should be mostly a performance boost for existing 0.3.x code. Improvements in 3DNow! vector x matrix, quaternion, and vector cross product would be good. Starting a Vector4 would also be a good idea.

Version v0.5
More code profiling, more aligned functions as well. Batch functionality would be a win, for example, transforming a large # of vectors by the same matrix. Perhaps some sine and cosine estimates too? Intel released SIMD source code that could be adapted to do some other common math functions quickly.

Version v0.6-0.9

??? What API needs to be added? More acceptance from general programming community would be nice as well.

Version 1.0

Heavily optimized, all functions implemented for all relevant code paths. Batch and aligned functions too.