[FFmpeg-devel] [PATCH v4] avcodec/aarch64/hevc: add luma deblock NEON
J. Dekker
jdek at itanimul.li
Wed Feb 28 10:30:02 EET 2024
Martin Storsjö <martin at martin.st> writes:
> On Wed, 28 Feb 2024, J. Dekker wrote:
>
>>
>> Martin Storsjö <martin at martin.st> writes:
>>
>>> On Tue, 27 Feb 2024, J. Dekker wrote:
>>>
>>>> Benched using single-threaded full decode on an Ampere Altra.
>>>>
>>>> Bpp Before After Speedup
>>>> 8 73,3s 65,2s 1.124x
>>>> 10 114,2s 104,0s 1.098x
>>>> 12 125,8s 115,7s 1.087x
>>>>
>>>> Signed-off-by: J. Dekker <jdek at itanimul.li>
>>>> ---
>>>>
>>>> Slightly improved 12bit version.
>>>>
>>>> libavcodec/aarch64/hevcdsp_deblock_neon.S | 417 ++++++++++++++++++++++
>>>> libavcodec/aarch64/hevcdsp_init_aarch64.c | 18 +
>>>> 2 files changed, 435 insertions(+)
>>>>
>>>> diff --git a/libavcodec/aarch64/hevcdsp_deblock_neon.S b/libavcodec/aarch64/hevcdsp_deblock_neon.S
>>>> index 8227f65649..581056a91e 100644
>>>> --- a/libavcodec/aarch64/hevcdsp_deblock_neon.S
>>>> +++ b/libavcodec/aarch64/hevcdsp_deblock_neon.S
>>>> @@ -181,3 +181,420 @@ hevc_h_loop_filter_chroma 12
>>>> hevc_v_loop_filter_chroma 8
>>>> hevc_v_loop_filter_chroma 10
>>>> hevc_v_loop_filter_chroma 12
>>>> +
>>>> +.macro hevc_loop_filter_luma_body bitdepth
>>>> +function hevc_loop_filter_luma_body_\bitdepth\()_neon, export=0
>>>> +.if \bitdepth > 8
>>>> + lsl w2, w2, #(\bitdepth - 8) // beta <<= BIT_DEPTH - 8
>>>> +.else
>>>> + uxtl v0.8h, v0.8b
>>>> + uxtl v1.8h, v1.8b
>>>> + uxtl v2.8h, v2.8b
>>>> + uxtl v3.8h, v3.8b
>>>> + uxtl v4.8h, v4.8b
>>>> + uxtl v5.8h, v5.8b
>>>> + uxtl v6.8h, v6.8b
>>>> + uxtl v7.8h, v7.8b
>>>> +.endif
>>>> + ldr w7, [x3] // tc[0]
>>>> + ldr w8, [x3, #4] // tc[1]
>>>> + dup v18.4h, w7
>>>> + dup v19.4h, w8
>>>> + trn1 v18.2d, v18.2d, v19.2d
>>>> +.if \bitdepth > 8
>>>> + shl v18.8h, v18.8h, #(\bitdepth - 8)
>>>> +.endif
>>>> + dup v27.8h, w2 // beta
>>>> + // tc25
>>>> + shl v19.8h, v18.8h, #2 // * 4
>>>> + add v19.8h, v19.8h, v18.8h // (tc * 5)
>>>> + srshr v19.8h, v19.8h, #1 // (tc * 5 + 1) >> 1
>>>> + sshr v17.8h, v27.8h, #2 // beta2
>>>> +
>>>> + ////// beta_2 check
>>>> + // dp0 = abs(P2 - 2 * P1 + P0)
>>>> + add v22.8h, v3.8h, v1.8h
>>>> + shl v23.8h, v2.8h, #1
>>>> + sabd v30.8h, v22.8h, v23.8h
>>>> + // dq0 = abs(Q2 - 2 * Q1 + Q0)
>>>> + add v21.8h, v6.8h, v4.8h
>>>> + shl v26.8h, v5.8h, #1
>>>> + sabd v31.8h, v21.8h, v26.8h
>>>> + // d0 = dp0 + dq0
>>>> + add v20.8h, v30.8h, v31.8h
>>>> + shl v25.8h, v20.8h, #1
>>>> + // (d0 << 1) < beta_2
>>>> + cmgt v23.8h, v17.8h, v25.8h
>>>> +
>>>> + ////// beta check
>>>> + // d0 + d3 < beta
>>>> + mov x9, #0xFFFF00000000FFFF
>>>> + dup v24.2d, x9
>>>> + and v25.16b, v24.16b, v20.16b
>>>> + addp v25.8h, v25.8h, v25.8h // 1+0 0+1 1+0 0+1
>>>> + addp v25.4h, v25.4h, v25.4h // 1+0+0+1 1+0+0+1
>>>> + cmgt v25.4h, v27.4h, v25.4h // lower/upper mask in h[0/1]
>>>> + mov w9, v25.s[0]
>>>
>>> I don't quite understand what this sequence does and/or how our data is laid
>>> out in our registers - we have d0 on input in v20, where's d3? An doesn't the
>>> "and" throw away half of the input elements here?
>>>
>>> I see some similar patterns with the masking and handling below as well - I get
>>> a feeling that I don't quite understand the algorithm here, and/or the data
>>> layout.
>>
>> We have d0, d1, d2, d3 for both 4 line blocks in v20, mask out d1/d2 and
>> use pair-wise adds to move our data around and calculate d0+d3
>> together. The first addp just moves elements around, the second addp
>> adds d0 + 0 + 0 + d3.
>
> Right, I guess this is the bit that was surprising. I would have expected to
> have e.g. all the d0 values for e.g. the 8 individual pixels in one SIMD
> register, and all the d3 values for all pixels in another SIMD register.
>
> So as we're operating on 8 pixels in parallel, each of those 8 pixels have
> their own d0/d3 values, right? Or is this a case where we have just one d0/d3
> value for a range of pixels?
Yes, d0/d1/d2/d3 are per 4 lines of 8 pixels, it's because d0 and d3 are
calculated within their own line, d0 from line 0, d3 from line 3. Maybe
it's more confusing since we are doing both halves of the filter at the
same time? v20 contains d0 d1 d2 d3 d0 d1 d2 d3, where the second d0 is
distinct from the first.
But essentially we're doing the same operation across the entire 8
lines, the filter just makes an overall skip decision for each block of
4 lines based on the sum of the result from line 0 and 3.
--
jd
More information about the ffmpeg-devel
mailing list