SSE instruction optimization challenges in C/C++

来源:互联网 发布:sql having count 编辑:程序博客网 时间:2024/04/29 20:38

Many suggest that the core of Ray-tracing should be implemented using SSE/SSE2, and several SIMD-based Ray-tracer has even been published. People say the performance has been enhanced amazingly…

Very attracting isn’t it… But SSE/SSE2 has an critical constraint of the data that SSE instructions can handle: all data should be 16-byte aligned, or there’ll be runtime errors. It is a typical trade-off between performance and convenience. (Another more common situation is the trade-off between performance and memory occupation amount)

Due to this intractably constraint, the data that is intended to give to SSE/SSE2 has to be specifically put:
    -- heap vars: _aligned_malloc
    -- global vars: __declspec(align(16))

An Intel guy implemented a SSE-based Ray-Tracer successfully http://software.intel.com/en-us/articles/architecture-of-a-real-time-ray-tracer/ I didn’t see any alignment decorator and I guess he used Intel compilers… Another paper on this:

http://www.computer.org/portal/web/csdl/doi/10.1109/TVCG.2009.73

 

Another constaint: std::vector cannot contain _declspec(align(16)) data.

 

At the same time, some complained that SSE code is no faster than VC optimized code. Also, PBRT doesn’t use SSE either but it does put a SSE option in the makefile as the compiler parameter. Maybe modern compilers are too smart and they all support SSE/SSE2, who knows… Anyway it is still worthy of further investigation.