[FFmpeg-devel] [rfc] qualification task: SSE2 IDCT

Michael Niedermayer michaelni
Sun Mar 30 12:49:23 CEST 2008

On Sun, Mar 30, 2008 at 03:37:28AM -0400, Alexander Strange wrote:
> I didn't have much time this week to do anything but school, but I've 
> written a working SSE2 adaption of simple_idct. It's not done yet, since 
> it's still too slow for me to accept it, but I've run out of obvious 
> low-level optimizations with this approach and don't want to just 
> disappear.

Hmm i thought you were working on a AP922/945 SSE2 IDCT (which would have been
easier because theres a paper from intel which lists SSE2 code ...)
Of course i am happy with a simple idct based one as well, though this is
non trivial and i doubt a little this can reach svn within time.

> Current times from dct-test:
> IDCT SIMPLE-C: 3610.0 kdct/s
> IDCT SIMPLE-MMX: 12738.6 kdct/s
> IDCT SIMPLE-SSE2: 9086.8 kdct/s
> IDCT XVID-MMX: 6837.2 kdct/s
> IDCT XVID-MMX2: 7819.4 kdct/s
> IDCT XVID-SKAL-SSE2: 11803.0 kdct/s

The first thing i must say, and iam not sure if you are happy to hear this ...
dct_test is useless for speed testing. You have to use actual videos because
the distribution of zeros and non zero elements differs.

> (making minor changes to dct-test decreases all the times by 30% - I guess 
> there's some code alignment problem there, but it doesn't affect accuracy)

> The current problems that I see are:
> * SSE pack/unpack are really slow on Core 2 - 3/4 cycles vs. 1 for MMX. I 
> didn't realize this until I was done, so maybe I can get rid of some of 
> them. Loading the same DCT into the MMX registers and using them for zero 
> short-circuiting was actually faster than using SSE...
> This might be better on A64 and Penryn.

> * All the negations are folded into pmaddwd, so there aren't any psubd uses 
> in the main part. I'm not sure if psubd/paddd can be parallelized any more 
> than two adds.

sub==add AFAIK

> * It doesn't use transposed input. How does SIMPLE_IDCT_PERM work? I can't 
> really see how it would save any shuffling in the row transform, but then I 
> haven't tried it. It seems like any other input order would make checking 
> for all-zero row ACs slower, which is the most important bit.

IIRC, the MMX code works with 2 rows at a time and it avoid all shuffles
and unpacks.
For SSE2 this is not possible i think, but you can still significantly
reduce their number.

> * The column part might suck; it runs out of registers, so can't really be 
> rescheduled, and I don't like the use of movq. Using transposed input and 
> the row transform twice would avoid it, but there would have to be another 
> transpose in the middle, using the slow punpcklwd. The one in 
> simple_idct_mmx looks clean, but I haven't checked out how it works yet.

The simple mmx idct has a dozen or so handwritten column idcts which are
selected based on what parts where zero. You should do something similar
for SSE2 instead of doing redundant branches. But before you need to think
about how to order elements to minimize shuffle and moves.

> * This is really easy to altivec 

No, not if done properly.

> and might be faster than the current 
> idct-altivec,

yes that could be

> static inline void simple_idct_sse2_row(int16_t *dct)
> {
>     int16_t *row = dct;
>     asm volatile (
>     "movq "MANGLE(m1000)", %%mm2\n"
>     "pxor %%xmm1, %%xmm1\n"
>     "lea 128(%0), %%"REG_c"\n"

>     ".align 4,0x90\n"

does this really do any good? And alignment should be achievable with fewer
instructions than 1 byte nops.

>     "0:\n" 
>     "movdqa (%0), %%xmm0\n"
>     "movq (%0), %%mm1\n" // mask out the DC and check if it's zero
>     "movq %%mm2, %%mm0\n" // on core 2 this is actually faster in MMX than SSE
>     "pandn %%mm1, %%mm0\n" 
>     "por 8(%0), %%mm0\n"
>     "packssdw %%mm0, %%mm0\n"
>     "movd %%mm0, %%eax\n"
>     "testl %%eax, %%eax\n"
>     "jnz 1f\n"

>     "pshuflw $0, %%xmm0, %%xmm0\n" // skip the whole IDCT if all AC values are zero

>     "pshufd $0, %%xmm0, %%xmm0\n"


asm code not really reviewed because this first needs some high level
thoughts on how to design it properly. Not the just take the elements and
use whatever shuffles and unpacks it needs to somehow get a correct result.

Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Let us carefully observe those good qualities wherein our enemies excel us
and endeavor to excel them, by avoiding what is faulty, and imitating what
is excellent in them. -- Plutarch
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080330/92c9425b/attachment.pgp>

More information about the ffmpeg-devel mailing list