[FFmpeg-devel] Inline ASM vs. Intrinsics

Fri May 11 12:13:54 CEST 2007

Hi

On Fri, May 11, 2007 at 09:25:38AM +0200, Guillaume POIRIER wrote:
[...]
> > > My question is if they are not used because of performance or if they
> > > are a big NoNo because of some other reason.
> > >
> > > I know that by using inline asm one has most control over what is going
> > > on. However with intrinsics the code is sometimes shorter and easier to
> > > read,
> 
> That's true for Altivec intrinsics, but x86 intrinsics are really
> horrible IMHO. It codes the type of data in the intrinsic name rather
> than by typing vectors.
> That means that with Altivec, you have vec_add() and vec_adds() to
> respectively do vector add, and vector saturated add, and on x86,
> you'd have _mm_add8(), _mm_add16(), _mm_add32(), _mm_add64(),
> _mm_adds8(), _mm_adds16(), , _mm_adds32(), _mm_adds64().
> I think that this certainly isn't more readable, and that it's rather
> ugly to have a "typeless" extension to a C language, which is a
> strongly typed language.
> 
> Off course, when you have an SIMD ISA that evolves with each new CPU
> model, you have a harder time to do things clean like with Altivec
> intrinsics.

the whole intrinsic thing is really nothing else than a different syntax
for asm, gcc could reorder instructions and it could allocate registers
optimally for the target CPU but in practice it fails at both and
hand optimized code will generally beat what gcc generated on all cpus
also theres the issue you mention that different cpus support different
instruction sets (3dnow vs, SSE2,  SSE3, ...) so in the end you have to
write the code multiple times anyway if you want it to be perfect even
with intrinsics ...

what gcc should rather do is analyze C code and compile it to SIMD
100% portable, no silly language extensions and gcc can generate the ideal
optimal code

> 
> 
> > although recent compilers are rather good in code generation. The
> > > intrinsics have one big advantage: you can use the very same code f?r
> > > x86-32 and x86-64 and on the latter the 8 extra registers are used
> > > automatically.
> >
> > I agree, and since gcc knows more about what intrinsics do than inline
> > assembly, gcc may optimize better with different march/mtune settings.
> >
> > However, gcc's register allocation algorithm sometimes does stupid
> > things, spilling registers when unnecessary. So, even if you write
> > everything in intrinsics, you should use 'gcc -S'  to double check.
> 
> I've experimented a bit with ICC-9.1 (not with GCC though), and
> analysed the quality of the code generation. I'm pleased to say that
> it generates really good code in general, but in some cases, it does
> some stupid things that a human who has a tiny bit of ASM expertise
> would never write.
> 
> But in general, ICC did a really good job at generating code out of intrinsincs.
> 
> I don't know about GCC, but I read a paper some month ago where the
> bleeding edge versions of GCC were able to beath ICC on syntetic
> benchmarks. I expect that on code that has a rather large data set,
> GCC will screw up its register allocation, where ICC should do better.

one problem with intrinsics also is that if the compiler screws up you
have to rewrite the code to asm, theres no working way to give it hints
which variables should be when in a register, its the same with c code

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

I do not agree with what you have to say, but I'll defend to the death your
right to say it. -- Voltaire
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070511/81c817e2/attachment.pgp>