[FFmpeg-devel] [PATCH] MMX implementation of VC-1 inverse transforms

Kostya kostya.shishkov
Wed Jan 16 06:18:11 CET 2008


On Wed, Jan 16, 2008 at 04:05:54AM +0200, Ivan Kalvachev wrote:
> On Jan 14, 2008 10:33 PM, Loren Merritt <lorenm at u.washington.edu> wrote:
> > On Mon, 14 Jan 2008, Ivan Kalvachev wrote:
> >
> > > - Why you choose to transpose at all. Just to save time and effort?
> > > It is usual to have separate version of SIMD depending if they work on
> > > row or columns. The row and column stages are different and you pass
> > > the differences as parameters.
> >
> > Who says it's usual?
> 
> It is usual because all IDCT functions in i386/dsp do it.
> These include libmpeg2, xvid, simple_mmx . The only IDCT that mentions
> transpose is vp3, but it also have separate col/row code.
> 
> Have in mind that I do not deny that mmx idct transforms use different
> permutations to get the coefficients in order they like, order that
> would make the first pass easier and eventually land the intermediate
> results in order that doesn't require additional transformation on or
> after second (col) pass.
> 
> > A transposed scantable and a column/transpose/column
> > transform is faster than a row/columntransform for iDCT and iHCT, I have
> > no reason to doubt that applies to VC1's transform as well.
> 
> Is there some theoretical explanation of this statement?
> 
> I'm sure you have actually tested both cases and I really want to peek
> at the h264 code that works without transpose, if you still keep it
> around.
> 
> On the other side, I'm not sure what do you mean by row/columntransform.
> The usual operation of above mentioned idct-es is scantable/row/column.

It means special version of transform for rows, instead of transposing
and applying column version to the rows as well (which became columns
after transposing).

[...]
> 
> On Jan 14, 2008 11:01 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
> > On Mon, Jan 14, 2008 at 09:12:40PM +0100, Christophe GISQUET wrote:
> > [...]
> > > > - Am I wrong or you do all the math in 16 bit signed saturation mode?
> > > > According to vc1 draft in first stage the input is in the range
> > > > [-2048;2047] the multiply constants  are in range [-16;16], this makes
> > > > range [-32768;32768] per multiply and you can have 8 of them.
> > > > Or multiply constants in range [-22;22], that make range
> > > > [-45056;45056] per multiply and you can have 4 of them.
> >
> > you are missing a detail here
> > 45056 >> 3 would be > 4096 thus possibly violate the limit for the 2nd stage
> > input. Still the 512 limit of the output with >>7 before does not look like
> > the naive implementation will work with 16bit
> 
> It's not my fault that at M$ cannot math ;)
> The draft actually says that the intermediate result have to be
> saturated in that range. So it is possible that the C variant also
> doesn't work according to the specs. I wonder what the reference
> source does?

Matrix multiplications, of course.
That was the first thing I optimized in order to get faster reference decoder.

> > > > In the second phase the input range is doubled to [-4096,4095]
> > > >
> > > > Are you sure your transforms produce the same result as their _c equivalents?
> > >
> > > I did test bit exactness (with win32 dll output) but albeit on few
> > > sequences. Everything was perfect.
> > >
> > > The reference I found said it could be done on 16 bits maths.  Maybe it
> > > needs a bias to correct, but as output is usually in the range
> > > [-128;127], it's pretty symmetrical. However, indeed, it would be better
> > > if proof could be given.
> >
> > theres a difference between "can be done" and "it works with the naive
> > implementation"
> >
> > as random example:
> >
> > naive:
> > (22*X+17*Y) >> 3 will not work with 16bit and X and or Y =2048
> >
> > alternative:
> > ((X + ((X + (Y>>1))>>1))>>1) + 2*(X+Y) should work fine
> >
> >
> > there are of course many intermediate variants
> > the key point to keep in mind is that (2*x + y)>>1 == x + (y>>1)
> 
> I wonder if there is some collection of nasty mmx tricks, like the above one.

That is general arithmetics ;)
And I encountered a lot of nasty SIMD tricks in truemotion 2 code
(which is written in plain C).




More information about the ffmpeg-devel mailing list