[FFmpeg-devel] [PATCH] 8-bit hevc decoding optimization on aarch64 with neon

Rafal Dabrowa fatwildcat at gmail.com
Sun Nov 19 16:43:14 EET 2017


On 11/18/2017 07:41 PM, James Almer wrote:
> On 11/18/2017 3:31 PM, Rostislav Pehlivanov wrote:
>>>
>>>
>>> On 18 November 2017 at 17:35, Rafal Dabrowa <fatwildcat at gmail.com> wrote:
>>>
>>> This is a proposal of performance optimizations for 8-bit
>>> hevc video decoding on aarch64 platform with neon (simd) extension.
>>>
>>> I'm testing my optimizations on NanoPi M3 device. I'm using
>>> mainly "Big Buck Bunny" video file in format 1280x720 for testing.
>>> The video file was pulled from libde265.org page, see
>>> http://www.libde265.org/hevc-bitstreams/bbb-1280x720-cfg06.mkv
>>> The movie duration is 00:10:34.53.
>>>
>>> Overall performance gain is about 2x. Without optimizations the movie
>>> playback stops in practice after a few seconds. With
>>> optimizations the file is played smoothly 99% of the time.
>>>
>>> For performance testing the following command was used:
>>>
>>>      time ./ffmpeg -hide_banner -i ~/bbb-1280x720-cfg06.mkv -f yuv4mpegpipe
>>> - >/dev/null
>>>
>>> The video file was pre-read before test to minimize disk reads during
>>> testing.
>>> Program execution time without optimization was as follows:
>>>
>>> real    11m48.576s
>>> user    43m8.111s
>>> sys     0m12.469s
>>>
>>> Execution time with optimizations:
>>>
>>> real    6m17.046s
>>> user    21m19.792s
>>> sys     0m14.724s
>>>
>>>
>>> The patch contains optimizations for most heavily used qpel, epel, sao and
>>> idct
>>> functions.  Among the functions provided for optimization there are two
>>> intensively used, but not optimized in this patch:
>>> hevc_v_loop_filter_luma_8
>>> and hevc_h_loop_filter_luma_8. I have no idea how they could be optimized
>>> hence I leaved them without optimizations.
>>>
>>>
>>>
>>> Signed-off-by: Rafal Dabrowa <fatwildcat at gmail.com>
>>> ---
>>>   libavcodec/aarch64/Makefile               |    5 +
>>>   libavcodec/aarch64/hevcdsp_epel_8.S       | 3949 ++++++++++++++++++++
>>>   libavcodec/aarch64/hevcdsp_idct_8.S       | 1980 ++++++++++
>>>   libavcodec/aarch64/hevcdsp_init_aarch64.c |  170 +
>>>   libavcodec/aarch64/hevcdsp_qpel_8.S       | 5666
>>> +++++++++++++++++++++++++++++
>>>   libavcodec/aarch64/hevcdsp_sao_8.S        |  166 +
>>>   libavcodec/hevcdsp.c                      |    2 +
>>>   libavcodec/hevcdsp.h                      |    1 +
>>>   8 files changed, 11939 insertions(+)
>>>   create mode 100644 libavcodec/aarch64/hevcdsp_epel_8.S
>>>   create mode 100644 libavcodec/aarch64/hevcdsp_idct_8.S
>>>   create mode 100644 libavcodec/aarch64/hevcdsp_init_aarch64.c
>>>   create mode 100644 libavcodec/aarch64/hevcdsp_qpel_8.S
>>>   create mode 100644 libavcodec/aarch64/hevcdsp_sao_8.S
>>
>>
>> Very nice.
>> The way we test SIMD is to put START_TIMER("function_name"); and
>> STOP_TIMER; (they're located in libavutil/timer.h) around where the
>> function gets called in the C code, then we do a run with the C code (no
>> SIMD) and a separate run with whatever SIMD optimizations we're
>> implementing. We take the last printed value of both runs and that's what's
>> used to measure speedup.
>>
>> I don't think there's a need to split the patch into multiple patches for
>> each idividual version though yet, that's usually only done if some
>> function's C implementation is faster than the SIMD code.
> It would be nice however to at least split it into two patches, one for
> MC and one for SAO.
Could you explain whose functions are MC?

I can split patch into a few, but dependency between patches
is unavoidable because the non-optimized function pointers are
replaced with optimized all together, in one function body.
One of the patches must add the function and must add the function call.
>
> Also, no way to use macros in aarch64 asm files? ~11k lines of code is a
> lot to add, and I'm sure a sizable portion is duplicated with only some
> small differences between functions.
I used macros sparingly because code without macros is
easier to understand and to improve. Sometimes even order
of assembly instructions is important. But, of course, I can reduce
the code size using macros if the patch will be accepted. I didn't know
whether you are interested with the patch at all.


Regarding performance testing. I wrapped every function with another
one, which calls START_TIMER and STOP_TIMER. It looks these macros
aren't reentrant, I needed to force the program to run in single thread.
Without this I had strange results, very differing between runs, for 
example:

22190 UNITS in put_hevc_qpel_uni_h12_8,   16232 runs,    152 skips
1126 UNITS in put_hevc_qpel_uni_h12_8,   12001 runs,   4383 skips

Force to run in single-threaded mode was not easy, the -filter_threads
option didn't help.

Below is the outcome. Meaning of the columns:

FUNCTION - the function to optimize
UNITS_NOOPT - last UNITS result in run without optimization
OPT - last UNITS result in run with optimization
CALLS - sum of runs and skips
NSKIPS - number of skips in non-optimized version
OSKIPS - number of skips in optimized version


FUNCTION                 UNITS_NOOPT      OPT     CALLS   NSKIPS OSKIPS
-------------------------------------------------------------------------
idct_16x16_8                  113074    24079   2097152 0        0
idct_32x32_8                  587447   100434    524288 0        0
put_hevc_epel_bi_h4_8           7651     3654    524288      177 1857
put_hevc_epel_bi_h6_8          18377     6668     32768 0        0
put_hevc_epel_bi_h8_8          20644     6698   1048576       34 1298
put_hevc_epel_bi_h12_8         62927    18968     16384 0        0
put_hevc_epel_bi_h16_8         78601    21254    524288 0        4
put_hevc_epel_bi_h24_8        231004    53800      4096 0        0
put_hevc_epel_bi_h32_8        294058    63302    524288 0        0
put_hevc_epel_bi_hv4_8         13183     6264   2097152       67 3057
put_hevc_epel_bi_hv6_8         27672    12706    131072 0        0
put_hevc_epel_bi_hv8_8         31908    11184   2097152        4 1688
put_hevc_epel_bi_hv12_8        86370    29497     65536 0        0
put_hevc_epel_bi_hv16_8       104623    30717   1048576 0        3
put_hevc_epel_bi_hv24_8       302361    80610      8192 0        0
put_hevc_epel_bi_hv32_8       376614    92475   1048576 0        0
put_hevc_epel_bi_v4_8           7290     3368   2097152      338 4444
put_hevc_epel_bi_v6_8          19306     8423     65536 0        0
put_hevc_epel_bi_v8_8          20431     5795   2097152       12 2252
put_hevc_epel_bi_v12_8         61368    21050     16384 0        0
put_hevc_epel_bi_v16_8         74351    17655   1048576 0        9
put_hevc_epel_bi_v24_8        226914    51601      4096 0        0
put_hevc_epel_bi_v32_8        285476    55184   1048576 0        0
put_hevc_epel_h4_8              5826     3362    524288      667 2619
put_hevc_epel_h6_8             12852     5912     32768 0        0
put_hevc_epel_h8_8             13847     6009   1048576      237 1504
put_hevc_epel_h12_8            44210    17185     16384 0        0
put_hevc_epel_h16_8            53502    18642    524288 0        5
put_hevc_epel_h24_8           157030    48086      4096 0        0
put_hevc_epel_h32_8           193877    54837    524288 0        0
put_hevc_epel_hv4_8            11031     6379   2097152      316 1886
put_hevc_epel_hv6_8            23233    12730    131072 0        0
put_hevc_epel_hv8_8            25406    10989   2097152       21 1471
put_hevc_epel_hv12_8           70139    28821     65536 0        0
put_hevc_epel_hv16_8           81318    30190   1048576 0        4
put_hevc_epel_hv24_8          230829    75079     16384 0        0
put_hevc_epel_hv32_8          285945    92143   1048576 0        0
put_hevc_epel_uni_hv4_8        13255     7571   2097152 142      582
put_hevc_epel_uni_hv6_8        29279    14637    131072 0        0
put_hevc_epel_uni_hv8_8        31783    14114   1048576 0       26
put_hevc_epel_uni_hv12_8       85576    31757     32768 0        0
put_hevc_epel_uni_hv16_8       90346    29886    524288 0        0
put_hevc_epel_uni_hv24_8      281864    76862      1024 0        0
put_hevc_epel_uni_hv32_8      322135    91541     65536 0        0
put_hevc_epel_uni_v4_8          6826     3785   2097152      494 3496
put_hevc_epel_uni_v6_8         20113    10093     32768 0        0
put_hevc_epel_uni_v8_8         18883     6444   1048576 7      448
put_hevc_epel_uni_v12_8        59989    23523      8192 0        0
put_hevc_epel_uni_v16_8        63740    18096    262144 0        0
put_hevc_epel_uni_v24_8       208109    48880       512 0        0
put_hevc_epel_uni_v32_8       249717    50660    262144 0        0
put_hevc_epel_v4_8              5834     3056   2097152      970 5422
put_hevc_epel_v6_8             15541     8900     65536 0        0
put_hevc_epel_v8_8             14549     5476   2097152      296 3129
put_hevc_epel_v12_8            48518    22362     32768 0        0
put_hevc_epel_v16_8            53909    16483   1048576 0       23
put_hevc_epel_v24_8           166783    43662      4096 0        0
put_hevc_epel_v32_8           210650    47112   1048576 0        0
put_hevc_pel_bi_pixels4_8       4751     2923   2097152     7381 9232
put_hevc_pel_bi_pixels6_8      11774     5689     65536 0        0
put_hevc_pel_bi_pixels8_8      12269     4165   4194304     2298 12731
put_hevc_pel_bi_pixels12_8     36260    14031     65536 0        0
put_hevc_pel_bi_pixels16_8     42718    10421   4194304       21 3881
put_hevc_pel_bi_pixels24_8    137480    38423     32768 0        0
put_hevc_pel_bi_pixels32_8    172166    43996   8388608 0        3
put_hevc_pel_bi_pixels48_8    520118   133238      4096 0        0
put_hevc_pel_bi_pixels64_8    671892   173615   4194304 0        0
put_hevc_pel_pixels4_8          3859     3139   1048576     8926 9478
put_hevc_pel_pixels6_8          8453     6566     32768 0        0
put_hevc_pel_pixels8_8          7144     3093   4194304     4802 30239
put_hevc_pel_pixels12_8        25096    16648     65536 0        0
put_hevc_pel_pixels16_8        25472     9538   2097152      790 3094
put_hevc_pel_pixels24_8        93108    42948     32768 0        0
put_hevc_pel_pixels32_8       100331    37550   8388608 0        2
put_hevc_pel_pixels48_8       321258   137835      4096 0        0
put_hevc_pel_pixels64_8       387236   152538   4194304 0        0
put_hevc_qpel_bi_h4_8          34054    20498     16384 0        0
put_hevc_qpel_bi_h8_8          34264    10873    524288 0      801
put_hevc_qpel_bi_h12_8         85199    22938     16384 0        0
put_hevc_qpel_bi_h16_8        107035    20526    524288 0      488
put_hevc_qpel_bi_h24_8        323233    66440     16384 0        0
put_hevc_qpel_bi_h32_8        415699    76073    262144 0        0
put_hevc_qpel_bi_h48_8       1282990   246145      2048 0        0
put_hevc_qpel_bi_h64_8       1664853   260382    262144 0        0
put_hevc_qpel_bi_hv4_8         56239    31221     32768 0        0
put_hevc_qpel_bi_hv8_8         63859    21595   1048576 0       63
put_hevc_qpel_bi_hv12_8       143173    58139     65536 0        0
put_hevc_qpel_bi_hv16_8       184410    40468   1048576 0       15
put_hevc_qpel_bi_hv24_8       509364   134833     32768 0        0
put_hevc_qpel_bi_hv32_8       647015   125581    524288 0        0
put_hevc_qpel_bi_hv48_8      1929283   385204      4096 0        0
put_hevc_qpel_bi_hv64_8      2416442   430161    524288 0        0
put_hevc_qpel_bi_v4_8          37454    22461     32768 0        0
put_hevc_qpel_bi_v8_8          34500     9218   1048576        0 1291
put_hevc_qpel_bi_v12_8         87403    31659     32768 0        0
put_hevc_qpel_bi_v16_8        106589    19326   1048576 0      971
put_hevc_qpel_bi_v24_8        332644    78044     16384 0        0
put_hevc_qpel_bi_v32_8        405835    73886    524288 0        0
put_hevc_qpel_bi_v48_8       1266494   217496      2048 0        0
put_hevc_qpel_bi_v64_8       1677771   259481    524288 0        0
put_hevc_qpel_h4_8             29542    16982     16384 0        0
put_hevc_qpel_h8_8             26710    10452    524288 5      558
put_hevc_qpel_h12_8            67708    22021     16384 0        0
put_hevc_qpel_h16_8            81849    18637    524288 0      560
put_hevc_qpel_h24_8           258384    62392     16384 0        0
put_hevc_qpel_h32_8           321281    68451    262144 0        0
put_hevc_qpel_h48_8           984759   219657      2048 0        0
put_hevc_qpel_h64_8          1224717   227914    262144 0        0
put_hevc_qpel_hv4_8            51764    32150     32768 0        0
put_hevc_qpel_hv8_8            56369    21627   1048576 0       73
put_hevc_qpel_hv12_8          125191    48671     65536 0        0
put_hevc_qpel_hv16_8          159288    40749   1048576 0       10
put_hevc_qpel_hv24_8          438656   131331     32768 0        0
put_hevc_qpel_hv32_8          551607   121954    524288 0        0
put_hevc_qpel_hv48_8         1627266   397656      4096 0        0
put_hevc_qpel_hv64_8         2016176   414765    524288 0        0
put_hevc_qpel_uni_h4_8         21301    13384    131072 0        0
put_hevc_qpel_uni_h8_8         30057    11010    524288 7      486
put_hevc_qpel_uni_h12_8        84804    25790     16384 0        0
put_hevc_qpel_uni_h16_8        95333    24267    262144 0       17
put_hevc_qpel_uni_h24_8       318029    76951      4096 0        0
put_hevc_qpel_uni_h32_8       356799    72279     65536 0        0
put_hevc_qpel_uni_h48_8      1181308   237731       128 0        0
put_hevc_qpel_uni_h64_8      1401262   231221     16384 0        0
put_hevc_qpel_uni_hv4_8        39439    22837    262144 0        1
put_hevc_qpel_uni_hv8_8        60380    23283   1048576 0       77
put_hevc_qpel_uni_hv12_8      146759    56280     32768 0        0
put_hevc_qpel_uni_hv16_8      173329    45131    524288 0        2
put_hevc_qpel_uni_hv24_8      505434   139999     16384 0        0
put_hevc_qpel_uni_hv32_8      561402   120361    131072 0        0
put_hevc_qpel_uni_hv48_8     1854753   361780       256 0        0
put_hevc_qpel_uni_hv64_8     2142627   404073     32768 0        0
put_hevc_qpel_uni_v4_8         23081    12550    262144 0        0
put_hevc_qpel_uni_v8_8         30075     9971   1048576 5      511
put_hevc_qpel_uni_v12_8        89427    38025     16384 0        0
put_hevc_qpel_uni_v16_8        96131    21727    524288 0       23
put_hevc_qpel_uni_v24_8       328019    90689      8192 0        0
put_hevc_qpel_uni_v32_8       358340    71396    131072 0        0
put_hevc_qpel_uni_v48_8      1164812   176367       256 0        0
put_hevc_qpel_uni_v64_8      1464856   232866     32768 0        0
put_hevc_qpel_v4_8             31732    19999     32768 0        0
put_hevc_qpel_v8_8             25311     8967   1048576       10 1142
put_hevc_qpel_v12_8            67764    29917     32768 0        0
put_hevc_qpel_v16_8            78023    18260   1048576 0      819
put_hevc_qpel_v24_8           254724    75185     16384 0        0
put_hevc_qpel_v32_8           305639    69130    524288 0        0
put_hevc_qpel_v48_8           892900   240703      2048 0        0
put_hevc_qpel_v64_8          1149597   221632    524288 0        0
sao_edge_filter_8             600074    91811    524288 0        0



More information about the ffmpeg-devel mailing list