[FFmpeg-devel] [Updated PATCH 3/3] vc-1: Optimise parser (with special attention to ARM)

Michael Niedermayer michaelni at gmx.at
Wed Apr 23 04:26:22 CEST 2014


On Wed, Apr 23, 2014 at 01:41:04AM +0100, Ben Avison wrote:
> The previous implementation of the parser made four passes over each input
> buffer (reduced to two if the container format already guaranteed the input
> buffer corresponded to frames, such as with MKV). But these buffers are
> often 200K in size, certainly enough to flush the data out of L1 cache, and
> for many CPUs, all the way out to main memory. The passes were:
> 
> 1) locate frame boundaries (not needed for MKV etc)
> 2) copy the data into a contiguous block (not needed for MKV etc)
> 3) locate the start codes within each frame
> 4) unescape the data between start codes
> 
> After this, the unescaped data was parsed to extract certain header fields,
> but because the unescape operation was so large, this was usually also
> effectively operating on uncached memory. Most of the unescaped data was
> simply thrown away and never processed further. Only step 2 - because it
> used memcpy - was using prefetch, making things even worse.
> 
> This patch reorganises these steps so that, aside from the copying, the
> operations are performed in parallel, maximising cache utilisation. No more
> than the worst-case number of bytes needed for header parsing is unescaped.
> Most of the data is, in practice, only read in order to search for a start
> code, for which optimised implementations already existed in the H264 codec
> (notably the ARM version uses prefetch, so we end up doing both remaining
> passes at maximum speed). For MKV files, we know when we've found the last
> start code of interest in a given frame, so we are able to avoid doing even
> that one remaining pass for most of the buffer.
> 
> In some use-cases (such as the Raspberry Pi) video decode is handled by the
> GPU, but the entire elementary stream is still fed through the parser to
> pick out certain elements of the header which are necessary to manage the
> decode process. As you might expect, in these cases, the performance of the
> parser is significant.
> 
> To measure parser performance, I used the same VC-1 elementary stream in
> either an MPEG-2 transport stream or a MKV file, and fed it through ffmpeg
> with -c:v copy -c:a copy -f null. These are the gperftools counts for
> those streams, both filtered to only include vc1_parse() and its callees,
> and unfiltered (to include the whole binary). Lower numbers are better:
> 
>                 Before          After
> File  Filtered  Mean   StdDev   Mean   StdDev  Confidence  Change
> M2TS  No        861.7  8.2      650.5  8.1     100.0%      +32.5%
> MKV   No        868.9  7.4      731.7  9.0     100.0%      +18.8%
> M2TS  Yes       250.0  11.2     27.2   3.4     100.0%      +817.9%
> MKV   Yes       149.0  12.8     1.7    0.8     100.0%      +8526.3%
> 
> Yes, that last case shows vc1_parse() running 86 times faster! The M2TS
> case does show a larger absolute improvement though, since it was worse
> to begin with.

is it faster to do all the steps intermingled ?
iam asking because the code should be simpler if it just uses
the optimized start code search and optimized header parsing
while maintaining the current structure

for example the header parsing could be optmized like below:

@@ -30,6 +30,16 @@
 #include "vc1.h"
 #include "get_bits.h"

+/** The maximum number of bytes of a sequence, entry point or
+ *  frame header whose values we pay any attention to */
+#define UNESCAPED_THRESHOLD 37
+
+/** The maximum number of bytes of a sequence, entry point or
+ *  frame header which must be valid memory (because they are
+ *  used to update the bitstream cache in skip_bits() calls)
+ */
+#define UNESCAPED_LIMIT 144
+
 typedef struct {
     ParseContext pc;
     VC1Context v;
@@ -41,7 +51,7 @@ static void vc1_extract_headers(AVCodecParserContext *s, AVCodecContext *avctx,
     VC1ParseContext *vpc = s->priv_data;
     GetBitContext gb;
     const uint8_t *start, *end, *next;
-    uint8_t *buf2 = av_mallocz(buf_size + FF_INPUT_BUFFER_PADDING_SIZE);
+    uint8_t buf2[UNESCAPED_LIMIT + FF_INPUT_BUFFER_PADDING_SIZE];

     vpc->v.s.avctx = avctx;
     vpc->v.parse_only = 1;
@@ -55,8 +65,8 @@ static void vc1_extract_headers(AVCodecParserContext *s, AVCodecContext *avctx,

         next = find_next_marker(start + 4, end);
         size = next - start - 4;
-        buf2_size = vc1_unescape_buffer(start + 4, size, buf2);
-        init_get_bits(&gb, buf2, buf2_size * 8);
+        buf2_size = vc1_unescape_buffer(start + 4, FFMIN(size, UNESCAPED_THRESHOLD), buf2);
+        init_get_bits(&gb, buf2, FFMIN(buf2_size, UNESCAPED_THRESHOLD) * 8);
         if(size <= 0) continue;
         switch(AV_RB32(start)){
         case VC1_CODE_SEQHDR:
@@ -99,11 +109,9 @@ static void vc1_extract_headers(AVCodecParserContext *s, AVCodecContext *avctx,
             else
                 s->field_order = AV_FIELD_PROGRESSIVE;

-            break;
+            return;
         }
     }
-
-    av_free(buf2);
 }

 /**

 [...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

The misfortune of the wise is better than the prosperity of the fool.
-- Epicurus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20140423/fa0ff5e7/attachment.asc>


More information about the ffmpeg-devel mailing list