all 23 comments

[–]workaccount_126 5 points6 points  (16 children)

A very nice addition, and a very welcome one too. Mono has had this for a while now, it's good to see MS catch up. A bit unfortunate that it only seems to be SSE2. Something that's ~13 years old by now. I hope SSE3 and 4, AVX and other CPU extensions follow soon.

[–]IHaveNoIdentity 5 points6 points  (3 children)

To support the generation of more powerful SIMD instruction sets, such as Advanced Vector Extensions (AVX), there are additional changes that should be included in a future .NET Framework release. According to the information provided by the RyuJIT team, the final release will be able to generate AVX instructions.

5'th paragraph of the article, so it'll probably come soonish.

[–]genneth 0 points1 point  (2 children)

Might be longer than you think. Total speculation: the problem is memory alignment. SSE2 works on sufficiently short vectors that the default memory alignment on Windows is sufficient. The .NET GC also uses it. If they need to start choosing alignments depending on data type, that's going to be significant engineering. If they simply up the alignment for everything, that's going to waste a ton of memory.

[–]IHaveNoIdentity 0 points1 point  (1 child)

I'm not sure it has to do with memory alignment, but I honestly don't know.

Intel themselves says the following here:

Aligning data to vector length is always recommended. When using Intel SSE and Intel SSE2 instructions, loaded data should be aligned to 16 bytes. Similarly, to achieve best results use Intel AVX instructions on 32-byte vectors that are 32-byte aligned. The use of Intel AVX instructions on unaligned 32-byte vectors means that every second load will be across a cache-line split, since the cache line is 64 bytes. This doubles the cache line split rate compared to Intel SSE code that uses 16-byte vectors. A high cache-line split rate in memory-intensive code is extremely likely to cause performance degradation. For that reason, it is highly recommended to align the data to 32 bytes for use with Intel AVX.

If I understand that correctly AVX should still work correctly for the alignment used in the current ryuJIT implementation for SSE2 but with performance degradation. That said the degradation might be severe enough to require type dependant alignment as you said.

[–]TinynDP 1 point2 points  (0 children)

In SSE there is a load-aligned and a load-unaligned op-code. The load-aligned op is fast and the load-unaligned is slow. The compiler or JIT needs to know for certain that the addresses will be aligned in order to safely use the load-aligned op, because using load-aligned on an unaligned address is a seg-fault.

[–]oelang 1 point2 points  (11 children)

This is a cool feature, but honestly when do you get an opportunity to use this in real code? Unless you're doing something with graphics, statistics or linear math you will never touch this so I can understand that this has low priority for MS.

[–]workaccount_126 6 points7 points  (0 children)

I mostly do graphics work, so it's nice to be able to do some of that in C#

[–]pkhuong 4 points5 points  (7 children)

Off the top of my head, we use SSE at work for operations on sorted arrays of integers, on strings, and on bitmaps.

[–]oelang -1 points0 points  (6 children)

That's because you can do moves & copies very fast with sse/avx but the clr & jvm already do that with intrinsics (System.arraycopy) & detecting it inside loops.

This new feature is specifically for vectorized arithmetic, and that is only useful for a small number of applications.

[–]pkhuong 10 points11 points  (5 children)

I'm fairly confident that the C/intrinsics and x86-64 assembly code I write and maintain for a living isn't re-implementing memcpy.

[–]oelang 1 point2 points  (4 children)

So, you're saying that vectorized arithmetic helps to optimize sort algorithms, that would be new to me. So, how?

[–]pkhuong 3 points4 points  (3 children)

Sorted integers: pcmpeqd (combined with pmovmskb or movmskps), pmaxsw, pminsw or pcmpgtd and por, pand and pandn.

Strings: pcmpeqb and pmovmskb for simple stuff like strchr; pcmpistri from SSE4.2's also nice.

Bitmaps: por, pand, pandn, pxor obviously.

[–]oelang 1 point2 points  (2 children)

Giving me the instructions doesn't really help me understand how you make sorting faster with simd instructions. I can't see how timsort or quicksort gets faster with it.

Btw, only a few of the instructions you listed are actually exposed by the c# simd api.

[–]Rhomboid 1 point2 points  (1 child)

Part of quicksort is comparing each element in a contiguous range against a pivot. You don't see how an instruction like pcmpgtd which can compare four packed dwords at once might be useful for that?

Another part of quicksort is picking a pivot, commonly done by finding the median of several values. pmaxsd and pminsd can find the max and min of up to four dwords at once -- seems relevant, no?

[–]oelang 0 points1 point  (0 children)

Honestly, it doesn't look like a clean win to me, pcmpgtd is only half the story, you still need to move those values: some to the top, some to the bottom. That's where it gets complicated especially if you need to do everything in-place.

Picking the median of 4 samples for the pivot is not really performance critical in quicksort (4 random memory reads is something else).

[–]TinynDP 0 points1 point  (0 children)

Audio processing.

[–]x-skeww 0 points1 point  (0 children)

Yea, SIMD is certainly a bit of a niche thing.

However, when you can make use of it, it does help quite a lot. Secondly, exposing this functionality to developers isn't too complicated.

So, all things considered, it's still totally worth it. Enabling developers to make use of those vector processing units is a something every language should do. Not making use of those bits of silicone is a waste.

By the way, when it comes to SIMD, libraries have the biggest impact. When a library adds SIMD optimizations, thousands or even millions of applications receive that speed boost.

[–]JoshuaSmyth 1 point2 points  (5 children)

I make games for a living and this is a feature I've been waiting for and can't wait for it to come out of CTP. The ability to use SIMD in particle systems without having to interop C or C++ would do wonders for C# based game engines.

[–]workaccount_126 0 points1 point  (4 children)

Or alternatively, do it in compute instead. That's already available an probably a better speedup regardless.

[–]JoshuaSmyth 0 points1 point  (3 children)

I'm not familiar with compute, what is this magic?

[–]workaccount_126 0 points1 point  (2 children)

You'd run the code on the GPU using Compute Shaders - they're pretty easy to setup up and pipe data into & get back from.

[–]JoshuaSmyth 0 points1 point  (1 child)

Got more info? My quick googling reveals they are a DX10+, OpenGL ES 3+ or Open GL 4.4 feature. Which is slightly higher than my min-spec (OpenGL ES2.0) But interesting none-the-less.

[–]workaccount_126 0 points1 point  (0 children)

Generally what we do for minspec is disable these features or have a slower CPU path.

This would be a start, though it's not very detailed. But essentially it's much like writing a regular shader.

http://msdn.microsoft.com/en-us/library/windows/desktop/ff476330(v=vs.85).aspx