[FFmpeg-devel] [PATCH 2/4] avcodec/get_bits: Avoid 2nd bitstream read in GET_VLC() if bits are known at build and small
Andreas Rheinhardt
andreas.rheinhardt at outlook.com
Mon Oct 30 22:49:07 EET 2023
Michael Niedermayer:
> On Fri, Oct 27, 2023 at 05:10:32AM +0200, Andreas Rheinhardt wrote:
>> Michael Niedermayer:
>>> Signed-off-by: Michael Niedermayer <michael at niedermayer.cc>
>>> ---
>>> libavcodec/get_bits.h | 8 ++++++--
>>> 1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/libavcodec/get_bits.h b/libavcodec/get_bits.h
>>> index cfcf97c021c..86cea00494a 100644
>>> --- a/libavcodec/get_bits.h
>>> +++ b/libavcodec/get_bits.h
>>> @@ -581,8 +581,12 @@ static inline const uint8_t *align_get_bits(GetBitContext *s)
>>> n = table[index].len; \
>>> \
>>> if (max_depth > 1 && n < 0) { \
>>> - LAST_SKIP_BITS(name, gb, bits); \
>>> - UPDATE_CACHE(name, gb); \
>>> + if (av_builtin_constant_p(bits <= MIN_CACHE_BITS/2) && bits <= MIN_CACHE_BITS/2) { \
>>> + SKIP_BITS(name, gb, bits); \
>>> + } else { \
>>> + LAST_SKIP_BITS(name, gb, bits); \
>>> + UPDATE_CACHE(name, gb); \
>>> + } \
>>> \
>>> nb_bits = -n; \
>>> \
>>
>> This is problematic: The GET_VLC macro does not presume that
>> MIN_CACHE_BITS are available; there is code that directly uses GET_VLC
>> instead of get_vlc2().
>>
>> I had the same idea when I made my VLC patchset, yet I wanted to first
>> apply it (which I forgot). While investigating the above issue, I found
>> out that all users of GET_VLC always call UPDATE_CACHE immediately
>> before GET_VLC, so UPDATE_CACHE should be moved into GET_VLC;
>> furthermore, no user of GET_VLC relies on the reloads inside of GET_VLC.
>> The patches for this are here:
>> https://github.com/mkver/FFmpeg/commits/vlc Shall I send them?
>>
>> Notice that making GET_VLC more standalone enables improvements over the
>> current approach; yet it will not lead to optimal code: E.g. the VLCs in
>> decode_alpha_block() in speedhqdec.c are so short that one could read
>> both VLCs with only one UPDATE_CACHE(); another example is mjpegdec.c
>> which currently does this:
>>
>> GET_VLC(code, re, &s->gb, s->vlcs[1][ac_index].table, 9, 2);
>>
>> i += ((unsigned)code) >> 4;
>> code &= 0xf;
>> if (code) {
>> if (code > MIN_CACHE_BITS - 16)
>> UPDATE_CACHE(re, &s->gb);
>>
>> {
>> int cache = GET_CACHE(re, &s->gb);
>> int sign = (~cache) >> 31;
>> level = (NEG_USR32(sign ^ cache,code) ^ sign) - sign;
>> }
>>
>> LAST_SKIP_BITS(re, &s->gb, code);
>>
>> Because of the reloads in GET_VLC, there will always be at least
>> MIN_CACHE_BITS - 9 (= 16) bits available after GET_VLC, so one can read
>> code (<= 15) bits without updating the cache at all (16 in
>> MIN_CACHE_BITS - 16 is the maximum length of a VLC code used here); this
>> will no longer be possible with this optimization.
>> Btw: I am surprised that there is a branch before UPDATE_CACHE instead
>> of an unconditional UPDATE_CACHE. I also do not really see why this uses
>> these macros directly and not the functions.
>>
>> Given my objection to your patch #1, magicyuv will not benefit from
>> this; a different approach (see
>> https://github.com/mkver/FFmpeg/commit/9b5a977957968c0718dea55a5b15f060ef6201dc)
>> is to add a get_vlc() that uses the nb of bits used to create the VLC
>> and a compile-time upper bound for the maximum length of a VLC code as
>> parameters instead of the maximum depth of the VLC.
>>
>
>> Reading VLCs for the cached bitstream reader can btw also be improved:
>> https://github.com/mkver/FFmpeg/commit/fba57506a9cf6be2f4aa5eeee7b10d54729fd92a
>
> speaking of that, i was wondering if the alternatives we had in get_bits.h
> A32_BITSTREAM_READER wouldnt be worth reinvestigating
> especially when extended to 64bit some of these readers might perform better
> There are then just more bits available and fewer reads and fewer mispredicted
> branches for cached ones
>
> It would be somewhat nice if we could avoid having 2 different APIs as we have
> now with the cached and normal reader.
> Also the normal one with 64bit would be interresting, more available bits
> so fewer reads
I already did something like that in
b0fb8e82dde375efb8d5602f6cd479f210c1e93c. But be aware that the most
important reason we read more often than we necessarily have to is not
the number of bits in the cache, but that we stopped using the macros
(like SKIP_CACHE, SKIP_COUNTER) directly and used functions which are
more standalone (and IMO also more readable).
Furthermore, creating replacements of these macros is complicated for
cached bitstream readers, because of the implicit overread checks. Here
is the important part of read_nz:
if (n > bc->bits_valid) {
if (BS_FUNC(priv_refill_32)(bc) < 0)
bc->bits_valid = n;
}
return BS_FUNC(priv_val_get)(bc, n);
priv_val_get presumes there to be enough valid bits available; otherwise
bits_valid wraps around.
This does not mean that there is no hope for avoiding unnecessary reads:
We could add an unsigned get_bits_cached(GetBitContext *gb, int n)
(basically a get_bits(), but the valid bits are the most significant
bits (for BE)) and get_bits_from_cache(unsigned *cache, int n)
(basically, SHOW_UBITS+SKIP_CACHE (or priv_val_get for the cached API)
and that could then be used as follows:
unsigned cache = get_bits_cached(gb, 4 * 4);
s->mpeg_f_code[0][0] = get_bits_from_cache(&cache, 4);
s->mpeg_f_code[0][1] = get_bits_from_cache(&cache, 4);
s->mpeg_f_code[1][0] = get_bits_from_cache(&cache, 4);
s->mpeg_f_code[1][1] = get_bits_from_cache(&cache, 4);
This works because get_bits_cached() would already remove the specified
number of bits from the bitstream; but this also shows its restrictions:
It is only usable when the number of bits to be read subsequently is
known in advance.
Another potential API is to add a show_bits_cached() that is the same as
show_bits(), but returns the valid bits in the most significant bits
(for BE). This could then be complemented with a get_bits_from_cache()
and in the end one needs to skip the number of bits actually consumed.
The latter would of course need to validate/sanitize the number of bits
consumed for both the cached and uncached reader. For the cached reader,
doing so incurs a check that is avoided in case one uses the
get_bits_cached() API outlined above (of course, this only applies to
the scenarios where one can actually use said API).
>
> also i was wondering about a vlc reader thats entirely free of conditional
> branches. Just a loop that in each iteration would step by 0-n symbols
> forward and update a pointer to which table to use next
Doesn't this have the downside that short symbols need as many
iterations as the longest one?
> but i dont think i will have the time to try to implement this.
>
> I have alot more ideas than i have time to try sadly, if you wan/can/or did
> try any of above that would be interresting
>
More information about the ffmpeg-devel
mailing list