[FFmpeg-devel] [PATCH] Bink Video decoder for FFmpeg 0.6

Kostya kostya.shishkov
Wed Feb 17 14:05:06 CET 2010


On Tue, Feb 16, 2010 at 11:45:31AM -0800, Jason Garrett-Glaser wrote:
> I'll try my hand at some patch review.
> 
> +    do {
> +        if (!get_bits1(gb)) {
> +            *dst++ = *src++;
> +            size--;
> +        } else {
> +            *dst++ = *src2++;
> +            size2--;
> +        }
> +    } while (size && size2);
> 
> Maybe dst++ should be moved out of the if?  I don't trust gcc to
> optimize away the duplicated increment that would otherwise be in each
> branch.

As discussed, we don't trust GCC to do optimize that increment in any
way.
 
> +    while (size--)
> +        *dst++ = *src++;
> +    while (size2--)
> +        *dst++ = *src2++;
> 
> This looks like a memcpy to me.

It is but since its size is guaranteed to be <= 16 in the best case,
replacing it with memcpy is just shaving pigs.

> +            v = -v;
> 
> Look at some of the existing decoders for better ways to branchlessly
> negate a value based on a get_bits1(gb) call.

ok, changed

> +            if (v & 0x80)
> +                v = -(v & 0x7F);
> 
> Same here.

I don't see an easy way to bit manipulate it.

> +            for (j = 0; j < len2; j++)
> +                *dst++ = v;
> 
> This looks like a memset to me.

it operates on int16_t*

> +                    mode_list[list_pos] = ((ccoef + 4) << 2) | 1;
> 
> +1 I would guess is better here than |1 because it can be done in an
> lea instruction (due to the shift).  On any non-x86 it'll be neutral.
> Same with all similar lines.

done

> +            switch (mode) {
> +            case 0:
> +            case 2:
> +                if (!mode) {
> +                    mode_list[list_pos] = ((ccoef + 4) << 2) | 1;
> +                } else {
> +                    mode_list[list_pos++] = 0;
> +                }
> 
> Yes, this was a good joke, I laughed too, but the if statement can be
> probably simplified slightly if you move the first if above "case 2:"
> ;)  Same with residuals section.

done

> +                        if (!bits) {
> +                            t = get_bits1(gb) ? -1 : 1;
> +                        } else {
> +                            t = get_bits(gb, bits) | mask;
> +                            if (get_bits1(gb))
> +                                t = -t;
> +                        }
> 
> Is get_bits valid with bits=0?  If so, this can be simplified.
> 
> Also, again with the ? -1 : 1 and t = -t and so forth: make it
> branchless if gcc doesn't already.

made
 
> +#define PUT2x2(dst, stride, x, y, pix) \
> +    dst[(x)*2 +     (y)*2       * stride] = \
> +    dst[(x)*2 + 1 + (y)*2       * stride] = \
> +    dst[(x)*2 +     ((y)*2 + 1) * stride] = \
> +    dst[(x)*2 + 1 + ((y)*2 + 1) * stride] = pix;
> +
> 
> Write-combining would be appropriate here with intreadwrite macros,
> depending on whether or not the block is unaligned.

dropped as discussed later

> +{
> +    c->dsp.get_pixels(block, src, stride);
> +    c->dsp.put_pixels_clamped(block, dst, stride);
> +}
> 
> Does this even make any sense?  Wouldn't the pixels already be clamped
> if you're copying them?  If you may in some cases be adding on top of
> existing pixels, maybe split that case out?  There's huge inefficiency
> going on here if only because of the constant and pointless conversion
> to DCTELEM.

replaced with single put_pixels_tab call
 
> +        bw = plane ? (avctx->width  + 15) >> 4 : (avctx->width  + 7) >> 3;
> +        bh = plane ? (avctx->height + 15) >> 4 : (avctx->height + 7) >> 3;
> 
> (avctx->height + (16>>!!plane)-1) >> (4-!!plane) ?

I'd like a bit more readability here

> +        if (!plane || !c->swap_planes)
> +            plane_idx = plane;
> +        else
> +            plane_idx = plane ^ 3;
> 
> I'd ternary this.

Why?

> if ((by & 1) && blk == 1) {
> +                    bx++;
> +                    dst  += 8;
> +                    prev += 8;
> +                    continue;
> +                }
> 
> What is "1"?  Right after this, you have a switch that uses enums for
> blk, so it isn't clear.

Yes, replaced with proper enum.

>  Why isn't this part of the switch?

It would only make SCALED_BLOCK case uglier.

> +                                    PUT2x2(dst, stride, pos & 7, pos >> 3, v);
> 
> Is it really efficient code-wise to add this extra &7/>>3 logic just
> to make the scan array a little bit smaller?

dropped

> +                        for (j = 0; j < 8; j++)
> +                            for (i = 0; i < 8; i++)
> +                                PUT2x2(dst, stride, i, j, block[i + j*8]);
> 
> OK, this is seriously inefficient now.  The run stuff was bad enough:
> I think we need a DSP function that takes an 8x8 array and outputs it
> to 16x16.  This would be so much faster in asm.  Split it out, put it
> in a DSP function, port all the code to use it (including run stuff),
> and then we can write asm later.

added

> +                        for (j = 0; j < 16; j++)
> +                            memset(dst + j*stride, v, 16);
> 
> Similarly here.  If we don't have a DSP function to fill a block with
> a constant value, we should make one.

added

> +                    default:
> +                        av_log(avctx, AV_LOG_ERROR, "Incorrect 16x16
> block type %d\n", blk);
> +                        return -1;
> +                    }
> 
> Is it even possible for this code to trigger given the limited values
> that blk can get when reading from even an invalid bitstream?

of course
there are 10 primary block types out of 16 possible and intra block
types are not allowed in 16x16 block mode (thus only 5 types there)

> +                    copy_block(c, block, ref, dst, stride);
> +                    c->dsp.clear_block(block);
> +                    v = get_bits(&gb, 7);
> +                    read_residue(&gb, block, v);
> +                    c->dsp.add_pixels_clamped(block, dst, stride);
> 
> This seems wasteful as well.  You're treating pixels like DCT
> coefficients and using DCT coefficient functions on them, which seems
> silly.

not as such, but they should still be stored as signed integers with >8
bits precision during decoding

> +                    for (i = 0; i < 64; i++)
> +                        block[i] = (block[i] * quant[i]) >> 11;
> 
> Why isn't this done as part of the decode residual function?  You're
> dequanting all 64 coefficients, even though most will be zero.

Mainly because quant table index must be read after all DCT coefficients
are decoded (sounds like bad design to me).

> Pass done.
> 
> Dark Shikari
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bink-v2.patch
Type: text/x-diff
Size: 79798 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100217/4e258f53/attachment.patch>



More information about the ffmpeg-devel mailing list