[FFmpeg-devel] [PATCH] MMX implementation of VC-1 inverse transforms
Fri Jan 18 11:05:04 CET 2008
On Wed, 16 Jan 2008, Ivan Kalvachev wrote:
> On Jan 14, 2008 10:33 PM, Loren Merritt <lorenm at u.washington.edu> wrote:
>> A transposed scantable and a column/transpose/column
>> transform is faster than a row/column transform for iDCT and iHCT, I
>> have no reason to doubt that applies to VC1's transform as well.
> Is there some theoretical explanation of this statement?
Because with very few exceptions, x86 SIMD instructions operate
element-wise on a pair of registers, not on pairs of values within one
register. Furthermore, any DCT more complex than a brute-force matrix
multiply won't perform the same operation on all coefficients at every
step. So even after you shuffle things around so that you can operate on
the right pairs of coefficients (using actual shuffle instructions
whereas column just takes different register names), some of the
arithmetic will be wasted.
> I'm sure you have actually tested both cases and I really want to peek
> at the h264 code that works without transpose, if you still keep it
There never was a row/column h264 idct in ffmpeg, but you can look at
x264_add8x8_idct8_mmx that was changed from row/column to
column/transpose/column in x264 r463.
> Loren, can you make simple_mmx even faster? (you would write it
> quicker than I could possibly write h264 inverse transform without
I'll post a patch once it's cleaned and generalized. As of now it's x86_64
ssse3 only, and twice as fast as simple_mmx. I'll have to see how much of
that speed depends on pmulhrsw and the extra xmmregs.
More information about the ffmpeg-devel