The optimized SIMD library for x86 processors.
What is libSIMDx86?
libSIMDx86, also know as just SIMDx86, is an optimized math library meant for developers of 3D games engines, 3D visualizations, 3D software rasterizers, and other simulations that uses SIMD instructions, that is Single Instruction, Multiple Data. Later it will support pixel operations for images, and digital signal processing.
How does it work?
The simple answer is with SIMD instructions: Ever since the Pentium MMX, x86 microprocessors have had special instructions that perform operations that cannot be represented by a single operation in C/C++, as well as many other programming languages. Often times these instructions can be used to unroll a loop, or do many operations per clock cycle, such as operate on four values (vector processing) instead of just one (scalar processing). The end result is code that runs anywhere from 0% to 2000% faster, depending on the circumstance and processor.
But I already wrote my library in C/C++, why use this?
Code written in C/C++ must be translated in assembly language by your compiler. Since the compiler often misses the larger picture and focuses on a per-line or per-element basis, it often outputs instructions that are scalar operations (i.e. work on a single value at a time) even when, to a human, the parallelism is apparent. Very few compilers output vector, or SIMD instructions because it requires an extremely sophisticated compiler. libSIMDx86 however is about 95% written in assembly language using vector (SIMD) instructions and does not need to be translated into a lower level language. Also, since humans can think on a problem, some interesting solutions that aren't available to the non-thinking compilers have been come up with.
Why create this library?
Let's face it, compilers are getting much better. Even before these special SIMD instructions, people use to write the most critical parts of their code in assembly language in order to get as much performance out of their processor as possible. Now, it seems that using basic assembly language for things gets you code that is on-par with the compiler. Compilers such as the GNU C Compiler have mastered the i386 instruction set and all of its quirks. However, while a rather large mastery of this has been obtained, SIMD output has always been a difficult task for compilers since the constructs of SIMD capable code is usually quite difficult to translate, much less see the author's intent. Efforts have been made, but even the simplest SIMD constructions still baffle compilers. Until that day far in the future where compilers output SIMD code correctly and efficiently, this library will allow programmers take advantage of SIMD operations that have been lying dormant in their processors since the Pentium MMX (1997). The results can be from no to mild to dramatic improvements in performance.
Another problem is that high levels of abstractions often cause the programmer to forget the hardware the program is running on. Learning assembly language is considered obsolete for all but the most esoteric reasons, such as writing an OS kernel or hacking programs. However, knowledge of the machine's architecture and instruction set allows people to take advantage of the true power of their processor and squeeze every ounce of performance out of it. In short, for those of us who do not want to learn SIMD assembly language but want to take advantage of a powerfully optimized library without sweating, this is for you. A thing to note is that out of all of Sourceforge, there are not any relatively complete libraries available that use SIMD instructions.
What about Intel and AMD? Don't they make a similar library?
Indeed they do, and I wouldn't be half surprised if they outperformed libSIMDx86 right now. However, one has to also understand that each company has not only a team of extremely diligent programmers, but that they also have written a library that is optimal for their implementation of the x86 architecture. That means, given two processors, one from Intel and one from AMD, that perform perfectly equal in everything else, running the Intel math library will work better for the Intel processor and the AMD library will work for the AMD processor better! Why is this? AMD/Intel know the best way of doing things for their architecture. Sometimes the optimization is at a hardware level, others at a software. Hardware level optimizations, i.e. programming for a single processor such as the Athlon64, is one optimization that the Intel/AMD math libraries will have over this one. However, this library will attempt to blend performance for many processors, even support for older processors, not just the latest and greatest Pentium IV or Athlon64 (at the time of this writing). Another thing to note is that their libraries are not free, and not open source by any means.
What SIMD instruction sets does libSIMDx86 support?
libSIMDx86 supports Intel's MMX, SSE, SSE2, SSE3; and AMD's 3DNow!, 3DNow!+, and MMX+. Additionally, standard x87 (i.e. non-SIMD) versions of the functions have been provided as a fallback and a control. There is no supprt for Cyrix's extensions to MMX – their processors are next to impossible to find, and the documentation on the instructions are mostly non-existant. If someone can send me documents and a Cyrix processor/motherboard, I will definitely add support for it.
I'll try it. How do I use the functions/where can I find examples?
Examples are included with the source tree, and documentation for the library can be found by clicking the link at the top, or here. If you have any questions on the library's use, you may also email the author.
How can I build my special version of libSIMDx86?
See the top link on building, or click here.
How can I build libSIMDx86 for non-x86 processors?
Ah... right... well, there had to be at least one who would ask, right?
Portability is a big issue for cross-platform developers, especially those targeting a different architecture altogether.
Despite the name, it is possible to run libSIMDx86 on non-x86 processors: compile without defining an SIMD instruction set. As of version 0.3.1, the non-SIMD version is written entirely in portable C, so that it could run on a PowerPC64 or UltraSPARC machine for example. Obviously, it won't be faster than a normal C library of similar functions, but at least it can compile and link without big #ifdef ... #else ... #endif blocks all over the place.
So if it can run under non-x86 architectures, do you intend to port it to use other architectures' SIMD instructions?
Well, yes and no. The name libSIMDx86 itself somewhat defies the idea of “porting” to other architectures, however if I had some documentation and hardware, I probably could implement some functions. For example, if I had a PowerPC with VMX, or UltraSPARC II with VIS, I could do some accelerated functions. I would like to make this library the ultimate solution for SIMD processing, a sort of one-stop-shop – however I might rename the project if it gets that far.
Can I build libSIMDx86 as a DLL or SO?
I suppose it is possible, but definitely not supported officially. Perhaps you could edit the output build script under UNIX/Linux and add -fPIC for position independent code. Note that this will probably degrade performance. Due to the fact that the library is written mostly in assembly language, I wouldn't see why a static link into a project would increase the binary that much. Some people hate having multiple binaries though, and that is perfectly understandable too. However if implemented correctly with what is known as code overlays, then you will effectively have a statically linked DLL/SO file. See the Roadmap section for more information.
Can you make libSIMDx86 select a function set at runtime rather than compile time?
Yes, actually I can. The ultimate goal of libSIMDx86 is to have overlaid code, that is, rather than use expensive function pointers like a DLL or SO file, libSIMDx86 is statically linked to. However at runtime, an instruction set (i.e. MMX, SSE) is selected, and overwrites the code with the optimized version. The result: the all flexibility of a DLL or SO file, with all the speed of a statically linked library. To do this requires the operating system's help, and thus makes libSIMDx86 not portable to all operating systems, or rather, requires an explicit port rather than just a mere recompilation.
I can't compile this under Microsoft Visual C++ 6, 7, or 8! What is wrong?
LibSIMDx86 only compiles successfully under GCC (that I know of). Sorry, it is the inline assembly language syntax.
But I need to use libSIMDx86 under my MSVC++ project!
Well, you have a few options.
*You could wait until the code overlay version comes out.
* You could compile it as DLL using GCC (which is not supported by this project, but possible), then use runtime linking.
* You could convert the AT&T/GNU syntax to Intel/MSVC syntax and then recompile it – if you do this, please send me the resulting work, I wouldn't mind having a MSVC++ version.
* You could make the project into separate .asm files.
* You could use only the C version of the project (but then you would have no SIMD acceleration)
* You could convert your project to use GCC instead of MSVC++ (it is not as hard as you think!)
Couldn't you just convert the project to separate .asm files to assemble?
Yup, I could.
Aren't you going to do it?
Eventually (see the road map), but there is a lot of work involved. The next release will include this.
There is a function I need but isn't implemented, can you do it for me?
It is possible: just send an email to the project admin(s), or post a feature request on the website (below).
Something is broken here! How do I submit a bug/patch?
Eeek! Since libSIMDx86 is written mostly in assembly language, it is very possible that bug crept in. If you think you have found a bug, and can isolate it to a function, then by all means, submit a patch (if you can) or bug report at the website at Sourceforge.net:
A surefire way of determining a bug in a function is to build libSIMDx86 under x87 mode, so it uses correct, but not-SIMD instructions and compare the outputs of the functions given the same input.
Generously hosted by: