[FFmpeg-devel] [PATCH] Fix non-rounding up to next 16-bit aligned bug in IFF decoder

Sebastian Vater cdgs.basty
Thu Apr 29 15:36:07 CEST 2010


Sebastian Vater a ?crit :
> M?ns Rullg?rd a ?crit :
>   
>> Sebastian Vater <cdgs.basty at googlemail.com> writes:
>>
>>   
>>     
>>> Just got the idea, we can get rid of the GetBitContext
>>> completely...Instead of reading 4 bits, we simply read a byte:
>>> const uint8_t lut_offsets = *buf++; // instead of get_bits(gb,4);
>>>     
>>>       
>> That's a separate thing.
>>   
>>     
>
> Separate in what way? What did you mean exactly?
>
>   
>>> Then we do loop unrolling by 8 and do two accesses to lut one with >> 4
>>> and one with & 0x0F, or we get even rid of this and create a lut table
>>> with 256 entries using AV_WN64A / AV_RN64A ;-)
>>>
>>> The advance here is that on a 64 bit CPU we get another nice speed
>>> improvement ;-)
>>> If we avoid calculations with AV_RN64A etc.
>>>     
>>>       
>> Those macros don't do any calculations.  All they do is some magic to
>> avoid type aliasing errors.
>>   
>>     
>
> Yes, I know, but I meant stuff like (lut0[...] << 32ULL) | lut1[...];
>
> But this isn't necessary if we use an 8-bit table storing uint64_t's...
>
>   
>>   
>>     
>>> gcc just should use 2 registers on 32-bit CPU and that's it.
>>>     
>>>       
>> Should, but doesn't.
>>   
>>     
>
> With the way I meant above, it should...I'll test that now, but without
> a completed table and tell you what it does.
>
>   
Damn, that's fucking amazing!!!!

Just did 2 benchmarks, one with old patch 32-bit mode and one with get
rid of GetBitContext and AV_RN64A, etc.

Please note that I'm in my Office, so these Benchmarks are not for
Athlon XP +2100, but for a Pentium 4 processor (which we have here).

basty at euler:~/src/ffmpeg/build$ ./ffplay ../patches/MRLake.iff
FFplay version git-5b9f10d, Copyright (c) 2003-2010 the FFmpeg developers
  built on Apr 29 2010 15:19:11 with gcc 4.2.4 (Ubuntu 4.2.4-1ubuntu4)
  configuration:
  libavutil     50.14. 0 / 50.14. 0
  libavcodec    52.66. 0 / 52.66. 0
  libavformat   52.61. 0 / 52.61. 0
  libavdevice   52. 2. 0 / 52. 2. 0
  libswscale     0.10. 0 /  0.10. 0
[IFF @ 0x8b2feb0]Estimating duration from bitrate, this may be inaccurate
Input #0, IFF, from '../patches/MRLake.iff':
  Duration: N/A, bitrate: N/A
    Stream #0.0: Video: iff_byterun1, pal8, 737x595, PAR 17:20 DAR
737:700, 90k tbr, 90k tbn, 90k tbc
25720 dezicycles in decodeplane8, 1 runs, 0
skipsbasty at euler:~/src/ffmpeg/build$ ./ffplay ../patches/MRLake.iff
FFplay version git-5b9f10d, Copyright (c) 2003-2010 the FFmpeg developers
  built on Apr 29 2010 15:19:11 with gcc 4.2.4 (Ubuntu 4.2.4-1ubuntu4)
  configuration:
  libavutil     50.14. 0 / 50.14. 0
  libavcodec    52.66. 0 / 52.66. 0
  libavformat   52.61. 0 / 52.61. 0
  libavdevice   52. 2. 0 / 52. 2. 0
  libswscale     0.10. 0 /  0.10. 0
[IFF @ 0x8b2feb0]Estimating duration from bitrate, this may be inaccurate
Input #0, IFF, from '../patches/MRLake.iff':
  Duration: N/A, bitrate: N/A
    Stream #0.0: Video: iff_byterun1, pal8, 737x595, PAR 17:20 DAR
737:700, 90k tbr, 90k tbn, 90k tbc
25720 dezicycles in decodeplane8, 1 runs, 0 skips
21520 dezicycles in decodeplane8, 2 runs, 0 skips
19260 dezicycles in decodeplane8, 4 runs, 0 skips
17675 dezicycles in decodeplane8, 8 runs, 0 skips
16652 dezicycles in decodeplane8, 16 runs, 0 skips
16006 dezicycles in decodeplane8, 32 runs, 0 skips
15623 dezicycles in decodeplane8, 64 runs, 0 skips
15503 dezicycles in decodeplane8, 128 runs, 0 skips
15573 dezicycles in decodeplane8, 256 runs, 0 skips
15440 dezicycles in decodeplane8, 512 runs, 0 skips
15496 dezicycles in decodeplane8, 1024 runs, 0 skips
15422 dezicycles in decodeplane8, 2047 runs, 1 skips
15395 dezicycles in decodeplane8, 4095 runs, 1 skips
   2.44 A-V:  0.000 s:0.0 aq=    0KB vq=    0KB sq=    0B f=0/0   0/0


basty at euler:~/src/ffmpeg/build$ ./ffplay ../patches/MRLake.iff
FFplay version git-5b9f10d, Copyright (c) 2003-2010 the FFmpeg developers
  built on Apr 29 2010 15:19:11 with gcc 4.2.4 (Ubuntu 4.2.4-1ubuntu4)
  configuration:
  libavutil     50.14. 0 / 50.14. 0
  libavcodec    52.66. 0 / 52.66. 0
  libavformat   52.61. 0 / 52.61. 0
  libavdevice   52. 2. 0 / 52. 2. 0
  libswscale     0.10. 0 /  0.10. 0
[IFF @ 0x8b30eb0]Estimating duration from bitrate, this may be inaccurate
Input #0, IFF, from '../patches/MRLake.iff':
  Duration: N/A, bitrate: N/A
    Stream #0.0: Video: iff_byterun1, pal8, 737x595, PAR 17:20 DAR
737:700, 90k tbr, 90k tbn, 90k tbc
65280 dezicycles in decodeplane8, 1 runs, 0 skips
40420 dezicycles in decodeplane8, 2 runs, 0 skips
26430 dezicycles in decodeplane8, 4 runs, 0 skips
19030 dezicycles in decodeplane8, 8 runs, 0 skips
13897 dezicycles in decodeplane8, 16 runs, 0 skips
11361 dezicycles in decodeplane8, 32 runs, 0 skips
10090 dezicycles in decodeplane8, 64 runs, 0 skips
9600 dezicycles in decodeplane8, 128 runs, 0 skips
9390 dezicycles in decodeplane8, 256 runs, 0 skips
9114 dezicycles in decodeplane8, 512 runs, 0 skips
9063 dezicycles in decodeplane8, 1024 runs, 0 skips
9081 dezicycles in decodeplane8, 2048 runs, 0 skips
9176 dezicycles in decodeplane8, 4096 runs, 0 skips
   2.72 A-V:  0.000 s:0.0 aq=    0KB vq=    0KB sq=    0B f=0/0   0/0
21520 dezicycles in decodeplane8, 2 runs, 0 skips
19260 dezicycles in decodeplane8, 4 runs, 0 skips
17675 dezicycles in decodeplane8, 8 runs, 0 skips
16652 dezicycles in decodeplane8, 16 runs, 0 skips
16006 dezicycles in decodeplane8, 32 runs, 0 skips
15623 dezicycles in decodeplane8, 64 runs, 0 skips
15503 dezicycles in decodeplane8, 128 runs, 0 skips
15573 dezicycles in decodeplane8, 256 runs, 0 skips
15440 dezicycles in decodeplane8, 512 runs, 0 skips
15496 dezicycles in decodeplane8, 1024 runs, 0 skips
15422 dezicycles in decodeplane8, 2047 runs, 1 skips
15395 dezicycles in decodeplane8, 4095 runs, 1 skips
   2.44 A-V:  0.000 s:0.0 aq=    0KB vq=    0KB sq=    0B f=0/0   0/0


basty at euler:~/src/ffmpeg/build$ ./ffplay ../patches/MRLake.iff
FFplay version git-5b9f10d, Copyright (c) 2003-2010 the FFmpeg developers
  built on Apr 29 2010 15:19:11 with gcc 4.2.4 (Ubuntu 4.2.4-1ubuntu4)
  configuration:
  libavutil     50.14. 0 / 50.14. 0
  libavcodec    52.66. 0 / 52.66. 0
  libavformat   52.61. 0 / 52.61. 0
  libavdevice   52. 2. 0 / 52. 2. 0
  libswscale     0.10. 0 /  0.10. 0
[IFF @ 0x8b30eb0]Estimating duration from bitrate, this
mbasty at euler:~/src/ffmpeg/build$ ./ffplay ../patches/MRLake.iff
FFplay version git-5b9f10d, Copyright (c) 2003-2010 the FFmpeg developers
  built on Apr 29 2010 15:19:11 with gcc 4.2.4 (Ubuntu 4.2.4-1ubuntu4)
  configuration:
  libavutil     50.14. 0 / 50.14. 0
  libavcodec    52.66. 0 / 52.66. 0
  libavformat   52.61. 0 / 52.61. 0
  libavdevice   52. 2. 0 / 52. 2. 0
  libswscale     0.10. 0 /  0.10. 0
[IFF @ 0x8b2feb0]Estimating duration from bitrate, this may be inaccurate
Input #0, IFF, from '../patches/MRLake.iff':
  Duration: N/A, bitrate: N/A
    Stream #0.0: Video: iff_byterun1, pal8, 737x595, PAR 17:20 DAR
737:700, 90k tbr, 90k tbn, 90k tbc
25720 dezicycles in decodeplane8, 1 runs, 0 skips
21520 dezicycles in decodeplane8, 2 runs, 0 skips
19260 dezicycles in decodeplane8, 4 runs, 0 skips
17675 dezicycles in decodeplane8, 8 runs, 0 skips
16652 dezicycles in decodeplane8, 16 runs, 0 skips
16006 dezicycles in decodeplane8, 32 runs, 0 skips
15623 dezicycles in decodeplane8, 64 runs, 0 skips
15503 dezicycles in decodeplane8, 128 runs, 0 skips
15573 dezicycles in decodeplane8, 256 runs, 0 skips
15440 dezicycles in decodeplane8, 512 runs, 0 skips
15496 dezicycles in decodeplane8, 1024 runs, 0 skips
15422 dezicycles in decodeplane8, 2047 runs, 1 skips
15395 dezicycles in decodeplane8, 4095 runs, 1 skips
   2.44 A-V:  0.000 s:0.0 aq=    0KB vq=    0KB sq=    0B f=0/0   0/0


Benchmark for 64-bit patch using the following code:
static void decodeplane8(uint8_t *dst,
                         const uint8_t *buf,
                         const unsigned buf_size,
                         const unsigned bps,
                         const unsigned plane)
{
    START_TIMER;
    const uint8_t *end = dst + (buf_size * 8);
    const uint64_t *lut = plane8_lut[plane];
    for(; dst < end; dst += 8) {
        const uint64_t v  = AV_RN64A(dst) | lut[*buf++];
        AV_WN64A(dst, v);
    }
    STOP_TIMER("decodeplane8");
}

basty at euler:~/src/ffmpeg/build$ ./ffplay ../patches/MRLake.iff
FFplay version git-5b9f10d, Copyright (c) 2003-2010 the FFmpeg developers
  built on Apr 29 2010 15:19:11 with gcc 4.2.4 (Ubuntu 4.2.4-1ubuntu4)
  configuration:
  libavutil     50.14. 0 / 50.14. 0
  libavcodec    52.66. 0 / 52.66. 0
  libavformat   52.61. 0 / 52.61. 0
  libavdevice   52. 2. 0 / 52. 2. 0
  libswscale     0.10. 0 /  0.10. 0
[IFF @ 0x8b30eb0]Estimating duration from bitrate, this may be inaccurate
Input #0, IFF, from '../patches/MRLake.iff':
  Duration: N/A, bitrate: N/A
    Stream #0.0: Video: iff_byterun1, pal8, 737x595, PAR 17:20 DAR
737:700, 90k tbr, 90k tbn, 90k tbc
65280 dezicycles in decodeplane8, 1 runs, 0 skips
40420 dezicycles in decodeplane8, 2 runs, 0 skips
26430 dezicycles in decodeplane8, 4 runs, 0 skips
19030 dezicycles in decodeplane8, 8 runs, 0 skips
13897 dezicycles in decodeplane8, 16 runs, 0 skips
11361 dezicycles in decodeplane8, 32 runs, 0 skips
10090 dezicycles in decodeplane8, 64 runs, 0 skips
9600 dezicycles in decodeplane8, 128 runs, 0 skips
9390 dezicycles in decodeplane8, 256 runs, 0 skips
9114 dezicycles in decodeplane8, 512 runs, 0 skips
9063 dezicycles in decodeplane8, 1024 runs, 0 skips
9081 dezicycles in decodeplane8, 2048 runs, 0 skips
9176 dezicycles in decodeplane8, 4096 runs, 0 skips
   2.72 A-V:  0.000 s:0.0 aq=    0KB vq=    0KB sq=    0B f=0/0   0/0ay
be inaccurate

Disassembly of inlined decodeplane8 in 64 bit patch on x86_32:
     532:       8b 54 24 40             mov    0x40(%esp),%edx
     536:       c7 44 24 4c 00 00 00    movl   $0x0,0x4c(%esp)
     53d:       00
     53e:       8b 8a cc 00 00 00       mov    0xcc(%edx),%ecx
     544:       0f 31                   rdtsc
     546:       89 54 24 60             mov    %edx,0x60(%esp)
     54a:       8b 5c 24 60             mov    0x60(%esp),%ebx
     54e:       31 d2                   xor    %edx,%edx
     550:       c7 44 24 64 00 00 00    movl   $0x0,0x64(%esp)
     557:       00
     558:       8b 74 24 64             mov    0x64(%esp),%esi
     55c:       89 de                   mov    %ebx,%esi
     55e:       bb 00 00 00 00          mov    $0x0,%ebx
     563:       89 5c 24 60             mov    %ebx,0x60(%esp)
     567:       01 44 24 60             add    %eax,0x60(%esp)
     56b:       8b 44 24 44             mov    0x44(%esp),%eax
     56f:       89 74 24 64             mov    %esi,0x64(%esp)
     573:       11 54 24 64             adc    %edx,0x64(%esp)
     577:       2b 84 24 9c 00 00 00    sub    0x9c(%esp),%eax
     57e:       39 c8                   cmp    %ecx,%eax
     580:       76 02                   jbe    584 <decode_frame_ilbm+0x2d4>
     582:       89 c8                   mov    %ecx,%eax
     584:       8b 74 24 50             mov    0x50(%esp),%esi
     588:       8d 04 c6                lea    (%esi,%eax,8),%eax
     58b:       89 44 24 5c             mov    %eax,0x5c(%esp)
     58f:       8b 44 24 4c             mov    0x4c(%esp),%eax
     593:       c1 e0 07                shl    $0x7,%eax
     596:       05 00 00 00 00          add    $0x0,%eax
     59b:       89 44 24 58             mov    %eax,0x58(%esp)
     59f:       8b 44 24 5c             mov    0x5c(%esp),%eax
     5a3:       39 c6                   cmp    %eax,%esi
     5a5:       73 30                   jae    5d7 <decode_frame_ilbm+0x327>
     5a7:       8b ac 24 9c 00 00 00    mov    0x9c(%esp),%ebp
     5ae:       89 f7                   mov    %esi,%edi
     5b0:       0f b6 75 00             movzbl 0x0(%ebp),%esi
     5b4:       83 c5 01                add    $0x1,%ebp
     5b7:       8b 4c 24 58             mov    0x58(%esp),%ecx
     5bb:       8b 5f 04                mov    0x4(%edi),%ebx
     5be:       8b 07                   mov    (%edi),%eax
     5c0:       8b 54 f1 04             mov    0x4(%ecx,%esi,8),%edx
     5c4:       0b 04 f1                or     (%ecx,%esi,8),%eax
     5c7:       09 da                   or     %ebx,%edx
     5c9:       89 07                   mov    %eax,(%edi)
     5cb:       89 57 04                mov    %edx,0x4(%edi)
     5ce:       83 c7 08                add    $0x8,%edi
     5d1:       39 7c 24 5c             cmp    %edi,0x5c(%esp)
     5d5:       77 d9                   ja     5b0 <decode_frame_ilbm+0x300>
     5d7:       0f 31                   rdtsc
     5d9:       8b 1d 04 00 00 00       mov    0x4,%ebx
     5df:       89 d7                   mov    %edx,%edi
     5e1:       31 ed                   xor    %ebp,%ebp
     5e3:       89 fd                   mov    %edi,%ebp
     5e5:       bf 00 00 00 00          mov    $0x0,%edi
     5ea:       31 d2                   xor    %edx,%edx
     5ec:       01 c7                   add    %eax,%edi
     5ee:       11 d5                   adc    %edx,%ebp
     5f0:       89 5c 24 38             mov    %ebx,0x38(%esp)
     5f4:       83 eb 01                sub    $0x1,%ebx
     5f7:       0f 8e d9 00 00 00       jle    6d6 <decode_frame_ilbm+0x426>
     5fd:       89 f8                   mov    %edi,%eax
     5ff:       8b 1d 08 00 00 00       mov    0x8,%ebx
     605:       89 ea                   mov    %ebp,%edx
     607:       2b 44 24 60             sub    0x60(%esp),%eax
     60b:       8b 35 0c 00 00 00       mov    0xc,%esi
     611:       1b 54 24 64             sbb    0x64(%esp),%edx
     615:       89 44 24 68             mov    %eax,0x68(%esp)
     619:       8b 44 24 38             mov    0x38(%esp),%eax
     61d:       89 54 24 6c             mov    %edx,0x6c(%esp)
     621:       89 f1                   mov    %esi,%ecx
     623:       89 da                   mov    %ebx,%edx
     625:       0f a4 d1 03             shld   $0x3,%edx,%ecx
     629:       c1 e2 03                shl    $0x3,%edx
     62c:       89 14 24                mov    %edx,(%esp)
     62f:       89 c2                   mov    %eax,%edx
     631:       c1 fa 1f                sar    $0x1f,%edx
     634:       89 4c 24 04             mov    %ecx,0x4(%esp)
     638:       89 44 24 08             mov    %eax,0x8(%esp)
     63c:       89 54 24 0c             mov    %edx,0xc(%esp)


-- 

Best regards,
                   :-) Basty/CDGS (-:




More information about the ffmpeg-devel mailing list