[FFmpeg-devel] [PATCH] Some IWMMXT functions for libavcodec #2

Dmitry Antipov dmantipov
Wed May 21 16:23:52 CEST 2008

Siarhei Siamashka wrote:

> This is strange. If we assume that back-to-back WLDRD instructions
> introduce 1 cycle stall and WLDRD result latency is 3 cycles (like
> WMMX2 optimization manual describes), "pix_sum_iwmmxt2_pipelined" 
> should have no stalls except for a few unavoidable ones in the 
> very function epilogue.
> While your version should have a lot more additional stalls 
> because of back-to-back loads (22 cycles). And all the Intel
> WMMX manuals clearly state that CPU can't sustain the rate of
> loading 64-bit data on each cycle, so your code is not optimal.

For WMMX coprocessor:
  1) The manual says that the WMMX external memory bus is 32-bit,
     so it requires at least 2 clock cycles to load 64-bit value.
  2) there are 2 64-bit buffers for loading operations, so

     wldrd wr0, [%1]
     wldrd wr1, [%1, #8]

     do not cause a stall (see of 'old' WMMX manual). But 3 and
     more back-to-back loads do.

For WMMX2 coprocessor:
  1) External memory bus is 64-bit, so it should be enough to issue
     1 clock cycle to load 64-bit value. So, if we have no stalls,
     it should be able to load 64 bits on every cycle.
  2) the number of buffers is unknown, but D.3.2.3 of 'new' WMMX2 manual says
     "supports up to 8 outstanding loads" (which probably means there are 8 such
     buffers). As I understand, this means that

     wldrd wr0, [%1]
     wldrd wr1, [%1, #8]
     wldrd wr7, [%1, #56]

     do not cause a stall (9 and more do). But, according to the same section,
     back-to-back load/store:

     wldrd [whatever]
     wstrd [whatever]

     causes a stall.

If all of the above is right, I don't see any unresolved mysteries here. It becomes clear
that pix_sum_iwmmxt2_pipelined() may be faster on WMMX core where 4 back-to-back loads is
critical (and even 2 is critical if we take 32-bit memory bus into account). But it doesn't
matter on WMMX2 since it can a) survive up to 8 back-to-back loads without stalls and
b) load 64-bit value at every clock cycle.

So, you're always assuming that

wldrd wr0, [%1]
wldrd wr1, [%1, #8]

_always_ has 1 cycle stall. But it looks like this is true for WMMX, but not for WMMX2.

I'm definitely interesting in obtaining 'old' WMMX core and check this, BTW.


More information about the ffmpeg-devel mailing list