[FFmpeg-devel] [PATCH] FLAC parser

Sat Oct 23 04:08:59 CEST 2010

On Fri, Oct 22, 2010 at 12:11 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
> On Fri, Oct 22, 2010 at 04:54:36AM +0200, Michael Chinen wrote:
>> On Fri, Oct 22, 2010 at 1:15 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
>> > On Thu, Oct 21, 2010 at 06:39:45PM +0200, Michael Chinen wrote:
>> >> On Thu, Oct 21, 2010 at 1:05 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
>> >> > On Wed, Oct 20, 2010 at 03:38:14PM +0200, Michael Chinen wrote:
>> >> >> Hi,
>> >> >>
>> >> >> On Tue, Oct 19, 2010 at 2:48 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
>> >> >> >[...]
>> >> >> >> I did profiling again and it turns out I missed one exit point for the
>> >> >> >> function the last time. ?The non-flat wrap buffer version is about
>> >> >> >> 2-4% faster overall. ?I've squashed it into the 0003.
>> >> >> >
>> >> >> >what is the speed difference between current svn and after this patch ?
>> >> >>
>> >> >> I used the -benchmark flag for 'ffmpeg -i fourminsong.flac a.wav' and
>> >> >> five runs and got
>> >> >> without patch: utime = 2.044-2.042s
>> >> >> with patch: ? ?utime = 2.363-2.379s
>> >> >>
>> >> >> So flac demuxing with the parser is slower.
>> >> >
>> >> > its not a problem when the parser is needed, like for -acodec copy but when
>> >> > it is not needed then a 15% slowdown is a problem. That said it of course
>> >> > would be nicer if it was faster than that even when needed
>> >> >
>> >> >
>> >> >
>> >> > [...]
>> >> >> >
>> >> >> >
>> >> >> > [...]
>> >> >> >> +static int find_headers_search(FLACParseContext *fpc, uint8_t *buf, int buf_size,
>> >> >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? int search_start)
>> >> >> >> +
>> >> >> >> +{
>> >> >> >> + ? ?FLACFrameInfo fi;
>> >> >> >> + ? ?int size = 0, i;
>> >> >> >> + ? ?uint8_t *header_buf;
>> >> >> >> +
>> >> >> >> + ? ?for (i = 0; i < buf_size - 1; i++) {
>> >> >> >> + ? ? ? ?if ((AV_RB16(buf + i) & 0xFFFE) == 0xFFF8) {
>> >> >> >
>> >> >> > something based on testing several positions at once is likely faster
>> >> >> > like
>> >> >> > x= AV_RB32()
>> >> >> > (x & ~(x+0x01010101))&0x80808080
>> >> >> > that will detect 0xFF bytes and only after that testing the 4 positions for
>> >> >> > FFF8
>> >> >>
>> >> >> Hmm. ?Since in both cases (header there/header not there) this will
>> >> >> require more masks on a 2 byte int how will it be faster?
>> >> >> Also since it is 15 bits that we are looking for is the 32 bit
>> >> >> handling a mistake?
>> >> >
>> >> > the code is executed 4 times less often than your 2 byte masking
>> >> > see ff_avc_find_startcode_internal() for something quite similar
>> >>
>> >> Thanks I now see the light - I didn't see at first you meant to
>> >> process in 4 byte chunks.
>> >> It is about 2-3x faster with the multiple byte processing:
>> >> fastest without multibyte processing:
>> >> 357748 dezicycles in with more pos testing, 16337 runs, 47 skips
>> >> slowest with:
>> >> 127551 dezicycles in with more pos testing, 15364 runs, 1020 skips
>> >>
>> >> (of course there are more skips so it is harder to profile)
>> >>
>> >> The -benchmark utime dropped down to a range of
>> >> utime = 2.232-2.236
>> >> vs the prepatch:
>> >> utime = 2.049-2.058
>> >>
>> >> so now it is a slowdown of about 10%.
>> >
>> > what amount of that is the startcode search and what amount is the crc16 check?
>> > note START/STOP_TIMER can easily be used to test this
>>
>>
>> Since the startcode search function contains two calls two the
>> function that eventually calls the crc functions, and since they don't
>> get called every time I had to modify STOP_TIMER to omit the skips and
>> have them print out at the same time (not just on the powers of two).
>> I don't think omitting the skips makes the profiling too inaccurate,
>> but if anything I would guess it makes the startcode+crc profile
>> longer (since skips over long running times would occur more often
>> with fast calls.)
>> Attaching patch for timer just in case anyone wants to verify its
>> doing the right stuff.
>>
>> 1952876 dezicycles in crc, 2048 runs, 0 skipsts/s
>> 298719 dezicycles in startcode+crc search, 18698 runs, 0 skips
>> 307619 dezicycles in flac_parse, 20330 runs, 0 skips
>>
>> the total running time
>> crc ? ? ? ? ? ? ? ?= 1952876 * 2048 = 3999490048
>> startcode+crc ? ? ?= 298719 * 18698 = 5585447862
>> flac_parse ? ? ? ? = 307619 * 20330 = 6253894270
>>
>> so
>> startcode = 5585447862 - 3999490048 = 1585957814
>>
>> so startcode search is 28%, crc is 72%.
>> in the larger picture, profiling the entire flac_parse, startcode is
>> 25%, crc is 64% and the other parsing stuff/overhead is 11%. ?"other
>> parsing stuff/overhead" includes writing to the fifo, occasionally
>> copying to a wrap buffer to return a complete frame when it falls
>> across the wrap, and other small stuff like scoring a frame.
>
> i see
> Can you make the CRC16 check only be run in ambigous cases?
> I mean if there are frames with all parameters matching and frame numbers
> monotinously increasing and doing so by a equal amount then the crc16 check
> maybe isnt needed?
> example:
> valid looking frames with numbers:
> n=0 ? n=10 n=20 n=30
> (no crc check needed)
>
> n=0 n=5 n=10 n=20 n=30
> (crc16 check should be run)
>

This means not having selective chains based on crc, but as you say it
may be okay.
I just chained everything together within a certain radius and
optionally have the CRC based on any odd header stuff.

I tested to make sure the CRC stuff works on large files (>30 min)
that have a few false headers.  The code to avoid double checking crc
bytes looks a bit messy because of the chains.

65001 dezicycles in startcode search, 16384 runs, 0 skips
96898 dezicycles in flac_parse, 17782 runs, 0 skips

so now crc doesn't get called most of the time so as expected
startcode search is about 2/3rds and the rest 1/3rd of the running
time.

benchmark of original four min track
without patches
utime = 2.039 - 2.048
with parser patches
utime = 2.069 - 2.079

so now its about a 2% slowdown.  I would expect more of a slowdown
(~5%) from the previous profile, but maybe its attributable to short
profiling problems.

Michael
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-move-decode_frame_header-from-flacdec.c-to-flac.c-h.patch
Type: application/octet-stream
Size: 8119 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20101023/e702df8c/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-Add-error-codes-for-FLAC-header-parsing-and-move-log.patch
Type: application/octet-stream
Size: 7369 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20101023/e702df8c/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0003-Add-FLAC-Parser.patch
Type: application/octet-stream
Size: 39224 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20101023/e702df8c/attachment-0002.obj>