[FFmpeg-devel] [PATCH] FLAC parser

Fri Oct 22 04:54:36 CEST 2010

On Fri, Oct 22, 2010 at 1:15 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
> On Thu, Oct 21, 2010 at 06:39:45PM +0200, Michael Chinen wrote:
>> On Thu, Oct 21, 2010 at 1:05 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
>> > On Wed, Oct 20, 2010 at 03:38:14PM +0200, Michael Chinen wrote:
>> >> Hi,
>> >>
>> >> On Tue, Oct 19, 2010 at 2:48 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
>> >> >[...]
>> >> >> I did profiling again and it turns out I missed one exit point for the
>> >> >> function the last time. ?The non-flat wrap buffer version is about
>> >> >> 2-4% faster overall. ?I've squashed it into the 0003.
>> >> >
>> >> >what is the speed difference between current svn and after this patch ?
>> >>
>> >> I used the -benchmark flag for 'ffmpeg -i fourminsong.flac a.wav' and
>> >> five runs and got
>> >> without patch: utime = 2.044-2.042s
>> >> with patch: ? ?utime = 2.363-2.379s
>> >>
>> >> So flac demuxing with the parser is slower.
>> >
>> > its not a problem when the parser is needed, like for -acodec copy but when
>> > it is not needed then a 15% slowdown is a problem. That said it of course
>> > would be nicer if it was faster than that even when needed
>> >
>> >
>> >
>> > [...]
>> >> >
>> >> >
>> >> > [...]
>> >> >> +static int find_headers_search(FLACParseContext *fpc, uint8_t *buf, int buf_size,
>> >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? int search_start)
>> >> >> +
>> >> >> +{
>> >> >> + ? ?FLACFrameInfo fi;
>> >> >> + ? ?int size = 0, i;
>> >> >> + ? ?uint8_t *header_buf;
>> >> >> +
>> >> >> + ? ?for (i = 0; i < buf_size - 1; i++) {
>> >> >> + ? ? ? ?if ((AV_RB16(buf + i) & 0xFFFE) == 0xFFF8) {
>> >> >
>> >> > something based on testing several positions at once is likely faster
>> >> > like
>> >> > x= AV_RB32()
>> >> > (x & ~(x+0x01010101))&0x80808080
>> >> > that will detect 0xFF bytes and only after that testing the 4 positions for
>> >> > FFF8
>> >>
>> >> Hmm. ?Since in both cases (header there/header not there) this will
>> >> require more masks on a 2 byte int how will it be faster?
>> >> Also since it is 15 bits that we are looking for is the 32 bit
>> >> handling a mistake?
>> >
>> > the code is executed 4 times less often than your 2 byte masking
>> > see ff_avc_find_startcode_internal() for something quite similar
>>
>> Thanks I now see the light - I didn't see at first you meant to
>> process in 4 byte chunks.
>> It is about 2-3x faster with the multiple byte processing:
>> fastest without multibyte processing:
>> 357748 dezicycles in with more pos testing, 16337 runs, 47 skips
>> slowest with:
>> 127551 dezicycles in with more pos testing, 15364 runs, 1020 skips
>>
>> (of course there are more skips so it is harder to profile)
>>
>> The -benchmark utime dropped down to a range of
>> utime = 2.232-2.236
>> vs the prepatch:
>> utime = 2.049-2.058
>>
>> so now it is a slowdown of about 10%.
>
> what amount of that is the startcode search and what amount is the crc16 check?
> note START/STOP_TIMER can easily be used to test this

Since the startcode search function contains two calls two the
function that eventually calls the crc functions, and since they don't
get called every time I had to modify STOP_TIMER to omit the skips and
have them print out at the same time (not just on the powers of two).
I don't think omitting the skips makes the profiling too inaccurate,
but if anything I would guess it makes the startcode+crc profile
longer (since skips over long running times would occur more often
with fast calls.)
Attaching patch for timer just in case anyone wants to verify its
doing the right stuff.

1952876 dezicycles in crc, 2048 runs, 0 skipsts/s
298719 dezicycles in startcode+crc search, 18698 runs, 0 skips
307619 dezicycles in flac_parse, 20330 runs, 0 skips

the total running time
crc                = 1952876 * 2048 = 3999490048
startcode+crc      = 298719 * 18698 = 5585447862
flac_parse         = 307619 * 20330 = 6253894270

so
startcode = 5585447862 - 3999490048 = 1585957814

so startcode search is 28%, crc is 72%.
in the larger picture, profiling the entire flac_parse, startcode is
25%, crc is 64% and the other parsing stuff/overhead is 11%.  "other
parsing stuff/overhead" includes writing to the fifo, occasionally
copying to a wrap buffer to return a complete frame when it falls
across the wrap, and other small stuff like scoring a frame.

Michael
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0004-DEBUG-add-timer-for-startcode-vs-crc.patch
Type: application/octet-stream
Size: 5122 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20101022/e7359408/attachment.obj>