[FFmpeg-devel] [PATCH] avcodec/golomb: Mask shift amount before use in get_ue_golomb()

Fri Dec 11 22:14:32 CET 2015

On 07.12.2015 00:27, Ganesh Ajjanagadde wrote:
> On Sun, Dec 6, 2015 at 6:12 PM, Andreas Cadhalpun
> <andreas.cadhalpun at googlemail.com> wrote:
>> On 06.12.2015 22:48, Michael Niedermayer wrote:
>>> my concern is more on h264 (CAVLC) and hevc speed
>>
>> I tested with 444_8bit_cavlc.h264 added 100 together with the concat demuxer,
>> and couldn't see a measurable speed difference caused by this error check.
> 
> Ok, so here is my understanding of the situation.
> I think both of you are right, but have different perspectives on this.
> So in practice a log < 7 may be usually predicted correctly, and the
> compiler in all likelihood will continue to inline the function. Thus,
> excepting the first run, there should not be an issue, and maybe the
> compiler even feeds in the "likely" information for the first run
> also.
> 
> Nevertheless, I also understand Michael's perspective: h264 is
> arguably one of the most important codecs as supplied by FFmpeg. Even
> a 0.01% speedloss in some place should be done with extreme caution,
> since over time these may accumulate to something more sizable, say
> 0.5%. There should be a very good justification for it, and thus
> Michael spends effort trying to ensure that the security issue is
> fixed at identical asm.

I wouldn't call this a security issue, it's just undefined behavior.

> I personally agree with Michael here due to h264's importance (note:
> this is modulo Michael's fix being correct, which I can't comment
> upon).

Michael's proposed patch is more of a 'disable the warning' than
'fix the warning': It is still not good if the condition
happens, but no warning is emitted anymore.

> I would have thought differently 6 months back, but with more
> work with FFmpeg, I have shifted my views.
> To understand this better, see e.g commit how b7fb7c45 by me: I got
> rid of undefined behavior so as to not trigger ubsan stuff, and
> substituted with implementation defined behavior at zero cost that
> should be consistent across platforms in practice. Yes, one could
> insert a branch, but there is minimal utility: anyone feeding such
> extreme values is fuzzing or placing an irregular file in any case.

In contrast to the change of b7fb7c45, this undefined behavior is not
only triggered by fuzzed samples, but even by a FATE sample.
Silencing has the disadvantage that is not noticeable anymore, if
a correct sample triggers this condition.

> Also, to understand Michael's views better, see e.g
> https://lists.ffmpeg.org/pipermail/ffmpeg-devel/2015-March/170611.html.
> Even superficially "cosmetic" effects can have a cost. See also commit
> by me: 68e79b27a5ed7. Even small things can matter.
> 
> Really these things should be done with START/STOP timers since the
> change is in a tight inner construct.

OK, so I did test with START/STOP timers in get_ue_golomb, one for the
first branch (A) and one for the second (B).

For h264 I used again 444_8bit_cavlc.h264, but 1000 times concat'ed
together.

With the check in the B branch:
    976 decicycles in get_ue_golomb B,    2047 runs,      1 skips   
    747 decicycles in get_ue_golomb B,    1024 runs,      0 skips
    337 decicycles in get_ue_golomb A,16777054 runs,    162 skips   

Without the check:
    922 decicycles in get_ue_golomb B,    2046 runs,      2 skips   
   2309 decicycles in get_ue_golomb B,    1022 runs,      2 skips
    341 decicycles in get_ue_golomb A,16777036 runs,    180 skips   

(I shortened this as the low runs are anyway not significant.
Also don't get confused by the fact that the B run count apparently
decreases, it's just that get_ue_golomb gets inlined in several
places, so there are several counters. However, the A counter
with the second most runs has factor 1000 less runs, so is basically
negligible.)

This shows multiple things:
 * The B branch is not executed often enough to get reproducible timings.
 * The A branch is executed about 5000 times more often than the B branch.
   (That's what I meant with less likely branch.)

I also checked the cavs decoder, using the 100 times concat'ed cavs.mpg.

With the check in the B branch:
    629 decicycles in get_ue_golomb B, 4194260 runs,     44 skips   
    433 decicycles in get_ue_golomb A,268434102 runs,   1354 skips

Without the check:
    624 decicycles in get_ue_golomb B, 4194273 runs,     31 skips
    433 decicycles in get_ue_golomb A,268434203 runs,   1253 skips  

This shows:
 * The B branch gets about 0.8% slower with the check.
 * The A branch is executed about 70 times more often than the B branch.
 * So on average, get_ue_golomb gets 0.8%/70 = 0.01% slower for cavs.
 * The B branch is already about 50% slower than the A branch.

Extrapolating this to the h264 decoder, it would get slower by
about 0.8%/5000=0.00016%, which is negligible.
And that's only the runtime of get_ue_golomb, but the decoder
also does other things than calling this function.

Thus I still think that adding the error check here has practically
no speed cost and is thus the better alternative.

Best regards,
Andreas