[FFmpeg-devel] [PATCH] VC-1 MMX DSP functions
Sat Oct 13 13:28:22 CEST 2007
Michael Niedermayer a ?crit :
>> Agreed. However, you trade memory loads/unpacks for potentially worse
>> code parallelism/pairing and size (there are 4 loops unrolled here). I
>> wonder if that'll be a win. I leave that to a later patch.
> you have unrolled the loops in the horizontal direction that also increased
> the code size and instruction pairing is specific to the good old pentium
> it has no relevance today
Figures anyway will put to rest this discussion. For
vc1_put_ver_16b_shift2_mmx, with pmullw used instead of shift+add:
2979 dezicycles in ver, 524174 runs, 114 skips
(compared to ~3300 initially)
Now if, contrary to what your suggestion hinted at, we unroll the
2633 dezicycles in ver, 524208 runs, 80 skips
Is the code size 2x increase worth the 10% speed up?
All of this can be tested by checking #if 0" block in
vc1_put_ver_16b_shift2_mmx code or, globally, VERT_PIPELINE macro.
I also used your suggestion for the stride==offset case in
stride==offset and pipeline (unrolled because simpler to code):
2162 dezicycles in norm_pipe, 262091 runs, 53 skips
2528 dezicycles in norm, 524200 runs, 88 skips
This ~20% speed-up does result in also a 2x size increase for the
function. Not unrolling would I guess yield ~10% and 1.5x code size.
Attached patch allows to test/verify/report those figures.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 31938 bytes
Desc: not available
More information about the ffmpeg-devel