[FFmpeg-devel] [PATCH] ARM: NEON optimised simple_idct

Mon Aug 25 20:47:16 CEST 2008

Michael Niedermayer <michaelni at gmx.at> writes:

> On Mon, Aug 25, 2008 at 03:53:29PM +0100, M?ns Rullg?rd wrote:
>> Michael Niedermayer <michaelni at gmx.at> writes:
>> 
>> > On Mon, Aug 25, 2008 at 04:06:33AM +0100, Mans Rullgard wrote:
>> >> ---
>> >>  libavcodec/Makefile                  |    2 +
>> >>  libavcodec/armv4l/dsputil_arm.c      |   15 ++
>> >>  libavcodec/armv4l/simple_idct_neon.S |  383 ++++++++++++++++++++++++++++++++++
>> >>  libavcodec/avcodec.h                 |    1 +
>> >>  libavcodec/utils.c                   |    1 +
>> >>  5 files changed, 402 insertions(+), 0 deletions(-)
>> >>  create mode 100644 libavcodec/armv4l/simple_idct_neon.S
>> >> 
>> >
>> > is this idct binary identical in output to the C/MMX simple idct?
>> 
>> Yes.
>> 
>> >> +#ifdef HAVE_NEON
>> >> +        } else if (idct_algo==FF_IDCT_SIMPLENEON){
>> >> +            c->idct_put= ff_simple_idct_put_neon;
>> >> +            c->idct_add= ff_simple_idct_add_neon;
>> >> +            c->idct    = ff_simple_idct_neon;
>> >> +            c->idct_permutation_type = FF_NO_IDCT_PERM;
>> >> +#endif
>> >
>> > I do not know neon at all but, ive never seen a SIMD instruction set for
>> > which the identity permutation would have been optimal.
>> >
>> > Also i suspect that the MMX simple idct is a better basis from which to
>> > write other SIMD variants of the simple idct than the C one.
>> 
>> I can't read mmx code.  Could you explain briefly what optimisations
>> are possible with permuted input?  NEON has more and wider registers
>> than mmx, so it is reasonable to expect the optimal code to be quite
>> different.
>
> sure, but still i think our mmx code (not only the simple idct) contains
> a few tricks that should be applicable to many SIMD instruction sets.
>
> Lets see what i remember about the simple idct
> 1. it doesnt need any transposes due to using a tricky way of interleaving
>    elements. This trick depends on the pmaddw instruction
>    pmaddw(int32_t out[], int16_t in0[], int16_t in1[]){
>         out[i]= in0[2*i+0]*in1[2*i+0]
>                +in0[2*i+1]*in1[2*i+1]
>    }
>    If such a instruction isnt available then that trick isnt useable as is.

There is no such instruction.  There's normal multiply-accumulate and
pairwise add (with optional accumulate).

> Still its likely better to use a transposed permutation instead of
> the identity one as this means 1 transpose less in a SIMD IDCT.

That idea struck me as well.  I'll try it out.

>2. depending on the pattern of non zero / all zero rows one of 8
> optimized column transforms is used.  This may be a bad idea though
> for a CPU with a small code cache ...
>
> also maybe it would make sense to look at i386/idct_sse2_xvid.c
> which uses SSE2 (128bit registers), this one uses only 16bit operations
> for the column transform so it may be faster when the tricks of the simple
> idct arent applicable

Do you expect any sane person to be able to read that?  That's also
not bitexact, right?

> also
>
>     Intel 64 and IA-32 Architectures
>     Software Developers Manual
>                               Volume 2A (and B)
>            Instruction Set Reference
>
> contains very readable and unambigious explanations of what all the
> MMX, SSE* instruction do, if you ever want to decypher mmx or sse code

I have those documents, and reading Chinese is easier.

-- 
M?ns Rullg?rd
mans at mansr.com