[FFmpeg-devel] [PATCH] h264.c/decode_cabac_residual optimization

Måns Rullgård mans
Wed Jul 2 12:28:35 CEST 2008

Siarhei Siamashka wrote:
> On Wed, Jul 2, 2008 at 1:00 AM, M?ns Rullg?rd <mans at mansr.com> wrote:
>> "Siarhei Siamashka" <siarhei.siamashka at gmail.com> writes:
> [...]
>>> Typically pre-decrement is always preferred in code optimized for
>>> performance as it is generally faster. Something like this would be
>>> better (also it is closer to the old code):
>>> while( --coeff_count >= 0 ) {
>>> ...
>>> }
>>> You can try to compile this sample with the best possible
>>> optimizations, look at the assembly output and check where the
>>> generated code is better and why:
>>> /**********************/
>>> int q();
>>> void f1(int n)
>>> {
>>>     while (--n >= 0) {
>>>         q();
>>>     }
>>> }
>>> void f2(int n)
>>> {
>>>     while (n--) {
>>>         q();
>>>     }
>>> }
>>> /**********************/
>> Any half-decent compiler should generate the same code for those two
>> functions.
> That's not true, just because these two functions are not identical.
> Hint: what happens if you pass -1 or any other negative value to these
> functions?

Right... I somehow read the second one as while (n-- > 0).  If you
want to compare post- vs. pre-decrement, that is also what you should
be compiling, as otherwise you'll be comparing the speed of doing
different things.

>> GCC for ARM generates a slightly different, but equivalent, setup sequence,
>> and the loops are exactly the same.
> In my case, gcc 3.4.4 (using '-march=armv6 -O3 -c' options) generated
> the following assembly output, which is definitely better for 'f1' (3
> instructions in the inner loop instead of 4):
> 00000000 <f1>:
>    0:   e92d4010        stmdb   sp!, {r4, lr}
>    4:   e2504001        subs    r4, r0, #1      ; 0x1
>    8:   48bd8010        ldmmiia sp!, {r4, pc}
>    c:   ebfffffe        bl      0 <q>
>   10:   e2544001        subs    r4, r4, #1      ; 0x1
>   14:   5afffffc        bpl     c <f1+0xc>
>   18:   e8bd8010        ldmia   sp!, {r4, pc}
> 0000001c <f2>:
>   1c:   e92d4010        stmdb   sp!, {r4, lr}
>   20:   e2504001        subs    r4, r0, #1      ; 0x1
>   24:   38bd8010        ldmccia sp!, {r4, pc}
>   28:   e2444001        sub     r4, r4, #1      ; 0x1
>   2c:   ebfffffe        bl      0 <q>
>   30:   e3740001        cmn     r4, #1  ; 0x1
>   34:   1afffffb        bne     28 <q+0x28>
>   38:   e8bd8010        ldmia   sp!, {r4, pc}
> I'm curious, what is the output of your compiler?

I was using CodeSourcery GCC 4.1.2 (the only compiler that works with
NEON) and -O3 -mcpu=cortex-a8.  I'm at work now, so I can't post
the exact output, but the loop bodies were identical in both cases;
only the prologue was different, since (as you pointed out) negative
initial values have different effects.

Since I'm at work, I can try it with the commercial ARM compiler
(only an old version, unfortunately):

00000000 <f1>:
   0:   e92d4010        stmdb   sp!, {r4, lr}
   4:   e1a04000        mov     r4, r0
   8:   ea000000        b       10 <f1+0x10>
   c:   ebfffffe        bl      0 <q>
  10:   e2544001        subs    r4, r4, #1      ; 0x1
  14:   5afffffc        bpl     c <f1+0xc>
  18:   e8bd8010        ldmia   sp!, {r4, pc}

0000001c <f2>:
  1c:   e92d4010        stmdb   sp!, {r4, lr}
  20:   e1a04000        mov     r4, r0
  24:   ea000000        b       2c <f2+0x10>
  28:   ebfffffe        bl      0 <q>
  2c:   e2544001        subs    r4, r4, #1      ; 0x1
  30:   2afffffc        bcs     28 <q+0x28>
  34:   e8bd8010        ldmia   sp!, {r4, pc}

This is different from what gcc does, and the two loops are different.
The speed should, however, be exactly the same.

>> I can't be bothered to check x86.
> But I can. For this particular case, the difference between the
> following variants in 'decode_cabac_residual' is the following:
> "while( --coeff_count >= 0 ) { ... }"
> ...
>     3022:   66 89 04 4a             mov    %ax,(%edx,%ecx,2)
>     3026:   83 6c 24 1c 04          subl   $0x4,0x1c(%esp)
>     302b:   83 6c 24 0c 01          subl   $0x1,0xc(%esp)
>     3030:   0f 89 06 fe ff ff       jns    2e3c <decode_cabac_residual+0x42d>
>     3036:   e9 d3 01 00 00          jmp    320e <decode_cabac_residual+0x7ff>
>     303b:   8b 54 24 08             mov    0x8(%esp),%edx
>     303f:   81 c2 bc 1d 02 00       add    $0x21dbc,%edx
> ...
> "while( coeff_count-- ) { ... }"
> ...
>     3022:   66 89 04 4a             mov    %ax,(%edx,%ecx,2)
>     3026:   83 6c 24 1c 04          subl   $0x4,0x1c(%esp)
>     302b:   83 6c 24 0c 01          subl   $0x1,0xc(%esp)
>>    3030:   83 7c 24 0c ff          cmpl   $0xffffffff,0xc(%esp)
>     3035:   0f 85 01 fe ff ff       jne    2e3c <decode_cabac_residual+0x42d>
>     303b:   e9 d3 01 00 00          jmp    3213 <decode_cabac_residual+0x804>
>     3040:   8b 54 24 08             mov    0x8(%esp),%edx
>     3044:   81 c2 bc 1d 02 00       add    $0x21dbc,%edx
> ...
> The expression 'while( coeff_count-- )' has one extra instruction
> inside of the loop in 'decode_cabac_residual', also increasing the
> size of the function by 5 bytes. The compiler seems to internally
> convert it into 'while( --coeff_count != -1 )', which is less
> efficient.

Stupid compiler.

> Compiled FFmpeg on Pentium-M with gcc 4.2.3 using just './configure &&
> make', let me know if you get different results with other versions of
> gcc or other optimization options.

Try adding a suitable --cpu flag to configure.  In your case, that would
be --cpu=pentium-m.

> Of course, benchmarking with 'decizycles' can hardly reliable detect
> the difference in just 1 instruction, also gcc may generate different
> code for the other part of the source as a side effect, but they are
> unrelated to "while( coeff_count-- ) { ... }" vs. "while(
> --coeff_count >= 0 ) { ... }" case.

The difference comes probably not from post- vs. pre-decrement being used,
but rather from the fact that the logic is different.  Your point about
benchmarking is of course valid.

M?ns Rullg?rd
mans at mansr.com

More information about the ffmpeg-devel mailing list