[FFmpeg-devel] [rfc] qualification task: SSE2 IDCT

Alexander Strange astrange
Sun Mar 30 09:37:28 CEST 2008

I didn't have much time this week to do anything but school, but I've  
written a working SSE2 adaption of simple_idct. It's not done yet,  
since it's still too slow for me to accept it, but I've run out of  
obvious low-level optimizations with this approach and don't want to  
just disappear.

Current times from dct-test:
IDCT SIMPLE-C: 3610.0 kdct/s
IDCT SIMPLE-MMX: 12738.6 kdct/s
IDCT SIMPLE-SSE2: 9086.8 kdct/s

IDCT XVID-MMX: 6837.2 kdct/s
IDCT XVID-MMX2: 7819.4 kdct/s
IDCT XVID-SKAL-SSE2: 11803.0 kdct/s

(making minor changes to dct-test decreases all the times by 30% - I  
guess there's some code alignment problem there, but it doesn't affect  

The current problems that I see are:
* SSE pack/unpack are really slow on Core 2 - 3/4 cycles vs. 1 for  
MMX. I didn't realize this until I was done, so maybe I can get rid of  
some of them. Loading the same DCT into the MMX registers and using  
them for zero short-circuiting was actually faster than using SSE...
This might be better on A64 and Penryn.

* All the negations are folded into pmaddwd, so there aren't any psubd  
uses in the main part. I'm not sure if psubd/paddd can be parallelized  
any more than two adds.

* It doesn't use transposed input. How does SIMPLE_IDCT_PERM work? I  
can't really see how it would save any shuffling in the row transform,  
but then I haven't tried it. It seems like any other input order would  
make checking for all-zero row ACs slower, which is the most important  

* The column part might suck; it runs out of registers, so can't  
really be rescheduled, and I don't like the use of movq. Using  
transposed input and the row transform twice would avoid it, but there  
would have to be another transpose in the middle, using the slow  
punpcklwd. The one in simple_idct_mmx looks clean, but I haven't  
checked out how it works yet.

* This is really easy to altivec and might be faster than the current  
idct-altivec, which is different from both simple and xvid idct. I'll  
try to get around to writing simple_idct_altivec sometime.

* skal should license his idct under LGPL so I can port it from nasm  
without having to #ifdef it!

dct-test patch depends on the last ones I posted.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: dcttest-sse2.diff
Type: application/octet-stream
Size: 527 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080330/f6b35c76/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: add-sse2idct.diff
Type: application/octet-stream
Size: 2092 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080330/f6b35c76/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: simple_idct_sse2.c
Type: application/octet-stream
Size: 10554 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080330/f6b35c76/attachment-0002.obj>

More information about the ffmpeg-devel mailing list