[FFmpeg-devel] [RFC] An improved implementation of?ARMv5TE?IDCT (simple_idct_armv5te.S)

Michael Niedermayer michaelni
Sat Sep 15 01:16:06 CEST 2007


On Sat, Sep 15, 2007 at 01:06:17AM +0300, Siarhei Siamashka wrote:
> On 14 September 2007, Michael Niedermayer wrote:
> 
> [...]
> 
> > > +        smlabb v4, a3, v6, v1          /* v4 = v1 - W2*row[2] */
> > > +        smlabb v3, a4, v6, v1          /* v3 = v1 - W6*row[2] */
> > > +        smlatb v2, a4, v6, v1          /* v2 = v1 + W6*row[2] */
> > > +        smlatb v1, a3, v6, v1          /* v1 = v1 + W2*row[2] */
> >
> > [---]
> >
> > > +        smlabb v4, a4, v8, v4          /* v4 -= W6*row[6] */
> > > +        smlatb v3, a3, v8, v3          /* v3 += W2*row[6] */
> > > +        smlabb v2, a3, v8, v2          /* v2 -= W2*row[6] */
> > > +        smlatb v1, a4, v8, v1          /* v1 += W6*row[6] */
> > > +        ldrd   a3, w1357idct_rows_armv5te /* a3 = W1 | (W3 << 16) */
> > > +                                       /* a4 = W5 | (W7 << 16) */
> >
> > [---]
> >
> > > +        smlatb v4, a2, v7, v4          /* v4 += W4*row[4] */
> > > +        smlabb v3, a2, v7, v3          /* v3 -= W4*row[4] */
> > > +        smlabb v2, a2, v7, v2          /* v2 -= W4*row[4] */
> > > +        smlatb v1, a2, v7, v1          /* v1 += W4*row[4] */
> >
> > i think this can be implemented in fewer instructions, someting based on:
> >
> > v2 = v1 - W4*row[4]
> > v1 = v1 + W4*row[4]
> >
> > v3 = v2 - W6*row[2]
> > v4 = v1 - W2*row[2]
> >
> > v3 += W2*row[6]
> > v4 -= W6*row[6]
> >
> > v2 = 2*v2 - v3
> > v1 = 2*v1 - v4
> 
> Took a close look at it. That really should do the job (each statement mapping
> to one instruction), so we can save whole 4 cycles thanks to it. Though I'm
> a bit worried about possible overflows because of the *2 multiplication in the
> last two statements, so this code would be not completely identical to C
> implementation of simple_idct on some extreme cases of input data. Should we
> assume some sane restrictions for input data for regression testing?

as long as the operations are normal ANSI-C twos complement style (=not some
weird useless saturaton stuff) the code is identical to yours, no matter how
large the input values are

you could say that any overflows always would be canceled by other overflows

you can look in your favorite math book (or wikipedia) about rings and modular
arithmetic ...

[...]

-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Thouse who are best at talking, realize last or never when they are wrong.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070915/867b5fed/attachment.pgp>



More information about the ffmpeg-devel mailing list