[FFmpeg-devel] [PATCH] Fix mm_flags, mm_support for ARM

Måns Rullgård mans
Mon Jun 30 17:24:17 CEST 2008

Siarhei Siamashka wrote:
> On Saturday 28 June 2008, M?ns Rullg?rd wrote:
>> Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:
>> > On Saturday 28 June 2008, M?ns Rullg?rd wrote:
>> >> Michael Niedermayer <michaelni at gmx.at> writes:
>> >> > Do we have someone who has a arm cpu and can look into the above
>> >> > issue?
>> >>
>> >> I know exactly why it's different.  In simple_idct.c, the column
>> >> transform contains these lines:
>> >>
>> >>         /* XXX: I did that only to give same values as previous code */
>> >>         a0 = W4 * (col[8*0] + ((1<<(COL_SHIFT-1))/W4));
>> >>
>> >> It's simpler to code that as a0 = W4 * col[0] + (1 << (COL_SHIFT-1)).
>> >> Thinking about it, it only takes one more instruction on NEON, and
>> >> I've fixed that in my tree.  With a little luck, the extra instruction
>> >> can be dual-issued with something else.
>> >
>> > This part does not have any extra overhead in my finetuned version
>> > of ARMv5TE IDCT:
>> >
>> >   ldr    v1, xxx         /* v1 = (((1<<(COL_SHIFT-1))/W4)*W4) */
>> >   [some unrelated instructions to hide load latency]
>> >   smlatt v2, a2, v4, v1  /* A0t = W4 * (col_t[0] +
>> > ((1<<(COL_SHIFT-1))/W4)) */
>> >
>> > There is no reason why ARMv6 or NEON should have overhead too. So getting
>> > bit-identical results to C simple_idct is possible without sacrificing
>> > performance.
>> ((1<<(COL_SHIFT-1))/W4)*W4 doesn't fit in 16 bits, so that method
>> can't easily be used when everything else is in 16-bit vectors.
> But (1 << (COL_SHIFT-1)) does not fit it either. I don't see any significant
> difference between:
> a0 = W4 * col[0] + (1 << (COL_SHIFT-1));
> and
> a0 = W4 * col[0] + ((1<<(COL_SHIFT-1))/W4)*W4;
> Sure, the first one can be encoded as a constant immediate operand in ARM
> instruction and the second one can't, but that's not a big deal (it can be
> loaded from memory)

Loading from memory is slower than using an immediate operand, and being
wider than 16 bits it can't share a register with other values.

> and is unlikely to be a problem for your NEON code.

I've used one more instruction to get the right value.  I doubt it has
any measurable impact on performance, and I can't be bothered to try
thinking of an alternate solution.

> In any case, you can't get away using only 16-bit values, multiplication
> results should be still accumulated somewhere. The difference between
> these two implementations is just an initial accumulator value. But of
> course, I may be missing something. It would be nice to have a look at your
> NEON code.

NEON multiplies use same-size operands, so a 16-bit vector can't be
multiplied with a 32-bit value.  The accumulator can, however, be

My NEON code is available from git://git.mansr.com/ffmpeg.mru in
branch arm-neon.

>> Is your armv5te idct a total rewrite of what's in svn, or can the
>> changes be broken into sensible steps? If the latter, please send a
>> patch series.
> Do you remember our chat in IRC after you added your implementation of
> ARMv5TE IDCT to SVN? I said that it was a nice improvement over all the
> IDCT available at that point for ARM, but still far from optimal. You
> said that further improvements would be welcome. So here we are.
> I ended up mostly keeping all the code structure (I actually tried
> to modify code to process 1 column per iteration, but processing 2
> columns still was faster). But all the code had to be changed for
> better instructions scheduling and registers allocation.
> It can be artificially split into several patches, replacing all parts
> of your IDCT one at a time, but are these steps really sensible?

If the end result is a complete rewrite, there's little sense in that.
Had there been an overall resemblance, a patch series might have made

> This further optimization of ARMv5TE IDCT is a result of work approved
> and encouraged by you, so you should have some responsibility too ;)
> I wish we had a friendly community of ARM developers here, who would help
> each other. After all, am I the only one here who is interested in getting
> much faster ARMv5TE IDCT for FFmpeg?

I have faster ARM devices to play with now ;-)

> Could you or anybody else having compatible ARM device just do some
> benchmarking to confirm my results (I posted benchmarks here multiple
> times already). It would be a really good help. Because I feel that
> some people here still doubt that it provides a major performance
> improvement.
> In order to do this, you need to:
> 1. Download this source file and drop it over current 'simple_idct_armv5te.S':
> https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/libavcodec/armv4l/simple_idct_armv5te.S?root=mplayer&view=markup
> 2. Change MAX_NEG_CROP in 'simple_idct_armv5te.S' to 1024
> 3. Compile FFmpeg and do some benchmarks
> Once/if the performance improvement is confirmed, a help with integration
> would be really needed. That's not a joke, I really fail to see any problems
> with the "balign/ASMALIGN/stack alignment" stuff, so I can't fix them. A good
> example of a solution (a working patch) is very much welcome.
> Does NEON optimized code require more than 32-bit alignment for data and
> stack? If yes, I could just wait a bit and use your NEON code as an example
> how do this stuff in FFmpeg-approved way.

NEON supports both aligned (up to 32-byte) an unaligned memory access.
Aligned is obviously faster.  My NEON IDCT doesn't use the stack at all,
so stack alignment is not an issue there.

>> >> > Ideally would be the authors who claimed the code to be identical to
>> >> > the C code ...
>> >>
>> >> I wrote the ARMv6 version, but I never made any such claim.  In fact,
>> >> I believe I mentioned at the time that there was a slight difference.
>> >>
>> >> > If we have noone then we will likely have to disable these IDCTs. I do
>> >> > not want to create files that turn green and pink unless they are
>> >> > played on an ARM cpu ...
>> >>
>> >> I don't think the ARM CPUs where these apply will be used mostly for
>> >> playback, not encoding, and on those machines every cycle counts.
>> >
>> > Yes, that was one of the reasons why I did not strongly insist on
>> > disabling j_rev_dct_ARM that time (people could get a severe performance
>> > regressions and complain about it) :)
>> >
>> > In any case, ARMv6 idct still needs heavy optimizations, it is not very
>> > fast (on its target devices with ARM11 CPUs of course).
>> Well, it's considerably faster than the C IDCT, but I'm not denying it
>> could be improved.  Are you talking about sparse data handling, or
>> something else?
> It has quite a number of performance problems:
> 1. it's not using 64-bit load instructions (and dual loads are almost as
> useful as dual multiplies for IDCT)

Odd, I can't remember why I didn't use those.

> 2. heavy functions call/ret overhead (ordinary loops are a lot faster)

Why are function calls so slow?

> 3. sparse data handling

Yes, that is a bit lacking.

> It can be clearly seen if you try to benchmark it against optimized
> ARMv5TE IDCT on ARM11, and ARMv6 IDCT fails to provide any significant
> performance improvement (actually performance is roughly the same and
> I can't say which one is faster). Also ARMv5TE optimized IDCT can be
> easily modified to use ARMv6 saturation instructions instead of table
> lookups. That should save at least 64 cycles per _add/_put function
> and I suspect that this hacked ARMv5TE IDCT would easily leap ahead
> and outperform your ARMv6 IDCT.
> We have the following hardware features list:
> "dual loads", "dual multiplies", "saturation"
> Your IDCT uses only "dual multiplies" and "saturation"
> ARMv5TE IDCT uses only "dual loads", but can be trivially modified to use
> "saturation" too

How fast are the v5TE instructions on v6 compared to v6 SIMD instructions?
On Cortex-A8 they are slower than on ARM9, presumably because NEON is
the favoured way of doing things there.

> Optimal IDCT implementation should use them all, we don't have spare CPU
> cycles on embedded devices.


>> Did you have any patches?
> I just know how to fix it :) But we are not finished with ARMv5TE yet...
>> Remind me again what ARM devices you have.  I recently got my hands on
>> a Cortex-A8 (TI OMAP3530), and it's more fun to play around with than
>> the Nokia tablets.
> Right now I only have ARM926EJ-S core in TI OMAP1710 (Nokia 770) and
> ARM1136JF-S in TI OMAP2420 (Nokia N800). But I hope to get this one soon
> (well, once they become available for general public): http://beagleboard.org

That's what I have too, including a Beagleboard rev B.  I hardly touch
the 770, so if anyone wants it, give me an offer.

M?ns Rullg?rd
mans at mansr.com

More information about the ffmpeg-devel mailing list