[FFmpeg-devel] [flamefest-start] A little something on MMX/SSE intrinsics
Thu Feb 28 22:15:35 CET 2008
Michael Niedermayer wrote:
> I feel like iam talking against brick walls. The point is that intrinsics
> are flawed because they are unpredictable, gcc could generate efficient
> code from them, but it as well can (and does in current versions on x86)
> generate completely dismal code. This does not go away if gcc becomes better
> at generating code.
gcc isn't predictable even at managing asm blocks as we could experience
with the register constrained architectures... (yes x86 again)
> We write asm/intrinsics because gcc did NOT compile the C code to something
> efficient in at least some cases. Asm is optimized once and will then always
> be efficient for the cpu class for which it has been optimized. That is its
> a write once and forget thing. Intrinsics OTOH are at the mercy of the
> current compiler version and require constant maintaince to ensure that they
> dont get miscompiled to something inefficient.
I cannot agree more, in fact having a set of asm routines for G3, G5,
CELL and pa-semi wound be great. Same would be said for asm for P4 and
amd64, since they are _quite_ different in the end.
Sparing some pain and using intrinsics to get quite similar results for
the whole PPC/PPC64 or x68/x86_64 families wouldn't be bad as starting
> You can disagree that 5% speed difference does matter, in which case one
> might get away with intrinsics.
No, even a 3% can be important, BUT getting a large gain for every cpu
in a family with minimal effort is nicer that get the theoretical
maximum for a small subset and leave the rest in a painful situation.
> But the key advantage asm() has IMO is that the compiler can NOT second guess
> what the programmer wanted, it can NOT reorder the instructions behind the
> programmers back and it can NOT silently put unneeded load+stores between
The main issue with intrinsics is that they are more than often ugly and
do not deliver what they are supposed to, but that's is just an
implementation detail that could be ironed out with a little cooperation
between users and developers.
you can get silent load+store or even better have the whole outer loop
pessimized due bogus constraints in the asm block...
> Its a fundamental difference, not something which will go away as gcc becomes
> better at compiling intrinsics (if that ever will happen ...).
It's a fundamental difference, both approaches have good and bad sides
depending on what you want to consider more important.
> Also just in case anyone is curious about ICC performance with intrinsics
> intels application note about the SSE2 IDCT (AP-945 Using SSE2 to implement an
> Inverse Discrete Cosine Transform) contains a plain asm and a intinsics
> version both with benchmarks:
> SSE2 ASM 0.255 microseconds
> SSE2 intrinsics 0.277 microseconds
> theres a 8% speed loss
> As far as i can see the only people supporting intrinsics either
> A. cant code asm
> B. never properly compared asm and intrinsics
C. have different accounts on the issue even if they recognize it the
same way you do, probably because they experience different limitations
due different arches or different quirks in the compiler/asm mix.
> If iam wrong, please show me an example with altivec asm which you hand
> tuned (instructions optimally selcted and ordered by hand based on read and
> understood datasheets for the target cpu and the final instruction ordering
> selected by benchmark trial and error) and benchmark results against the
> equivalent intrinsic code.
I can pick libmotovec and libfreevec and have fun building both with and
w/out -mcpu=cell on the ps3 and have fun. Ah, no, libmotovec cannot
build on ppc64...
> It seems our disagreement is not about intrinsics vs. asm being better but
> about the minimum quality and performance of the code. 5% speedloss is not
> acceptable! Even much smaller speedlosses need some justification.
No you missed my point and I agree completely on yours.
> Yes asm is harder to write, but for that you get 5% more speed.
asm gives you most, iif you spend enough time to tune for the specific
cpu and ignore the others in the same family.
> And code quality standards in ffmpeg are high, writing 5% slower code is
> plain unacceptable.
I could say that having the x86 asm routines that happens to work by
hack on x86_64 are in that range, still better that than plain C, isn't it?
having both and benchmark which is better per arch would be the best. =)
So, what about having a SoC on this kind of stuff?
Gentoo Council Member
More information about the ffmpeg-devel