[FFmpeg-devel] [PATCH] DNxHD/VC-3 encoder

Sat Oct 6 01:18:16 CEST 2007

Hi

On Fri, Oct 05, 2007 at 11:53:30AM +0200, Baptiste Coudurier wrote:
> Hi
> 
> Michael Niedermayer wrote:
> > Hi
> > 
> > On Wed, Oct 03, 2007 at 10:45:04PM +0200, Baptiste Coudurier wrote:
> >> Hi Michael, thanks for the review,
> >>
> >> Michael Niedermayer wrote:
> >>> On Mon, Sep 24, 2007 at 01:00:12PM +0200, Baptiste Coudurier wrote:
> >>>
> >>>> Hi Michael,
> >>>>
> >>>> Michael Niedermayer wrote:
> >>>>
> >>>>> Hi
> >>>>>
> >>>>> On Fri, Sep 07, 2007 at 08:05:28PM +0200, Michael Niedermayer wrote:
> >>>>>
> >>>>>> Hi
> >>>>>>
> >>>>>> On Fri, Sep 07, 2007 at 09:17:49AM -0600, Loren Merritt wrote:
> >>>>>>
> >>>>>>> On Fri, 7 Sep 2007, Michael Niedermayer wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>> when people start using qsort you know theres something wrong in their
> >>>>>>>> encoder (or at least not optimal)
> >>>>>>>>
> >>>>>>>> so, lets first check if i understand the constraints
> >>>>>>>> * its possible to change qscale per macroblock and it doesnt cost anything
> >>>>>>>> * theres a fixed number of bits available for each frame (pre mpeg1 design)
> >>>>>>>>
> >>>>>>>> in that case the optimal solution (though not fastest) is to do a
> >>>>>>>> binary search over lambda (using the number of bits and the ones available
> >>>>>>>> to find the best lambda which results in a frame which is not too large)
> >>>>>>>>
> >>>>>>>> then for each such lambda, find the best qscale for each MB, again by
> >>>>>>>> binary search minimizing SSD + bits*lambda (that way we know the best
> >>>>>>>> qscales and how many bits the frame would need)
> >>>>>>> Your algorithm produces the same decisions as the one with qsort. I don't 
> >>>>>>> know which is faster.
> >>>>>>>
> >>>>>>> Consider the list of possible block encodings sorted by 
> >>>>>>> delta_ssd/delta_bits. Optimizing for qps at any given lambda will produce 
> >>>>>>> some prefix of that list. i.e. for any lambda, there is some 
> >>>>>>> position in the list such that all preceding entries have ssd/bits 
> >>>>>>> better than lambda, and all following entries have ssd/bits worse 
> >>>>>>> than lambda. So instead of evaluating bits(lambda) as a black box and 
> >>>>>>> searching for a specified values of bits, construct the function 
> >>>>>>> lambda(bits) and evaluate it at the specified value of bits.
> >>>>>> yes, but as far as i can see thats not what the current code does, it just
> >>>>>> tries qscale and qscale+1 if it would try all qscale then yes they would
> >>>>>> be identical
> >>>>> just to clarify my original review ...
> >>>>> i dont mind if the code uses qsort, what i was complaining about was that
> >>>>> it isnt finding the optimal (in the RD sense) solution
> >>>>>
> >>>>> if its changed to find the optimal one or its demonstrated that the
> >>>>> PSNR/bitrate difference is negligible then iam happy with that part
> >>>>>
> >>>> I tried lagrange. I keep the qsort rc because it is useable (> real time
> >>>> on quad core cpu with variance, -5 fps for ssd), rdo one is encoding at
> >>>> 4 fps :/
> >>>>
> >>>> Im wondering if I did screwed up the implementation because I have an
> >>>> IMHO negligeable gain with rdo.
> >>>>
> >>>> I reused existing quant functions.
> >>>>
> >>>> Threaded functions must return int else you get a warning.
> >>>>
> >>>> About encode_dc, it will be complicated to reuse existing ones because:
> >>>> mpeg12 use some static tables and generation code.
> >>>> mjpeg use uint16_t for dc_codes.
> >>>>
> >>>> About mb_var_thread, existing function use current_picture and set
> >>>> variables I don't need, mb_mean, and use me field too ...
> >>>>
> >>>> Attached are some psnr results with one HD uncompressed sample I have.
> >>>>
> >>>> Any other optimization because what Loren suggested earlier ?
> >>>
> >>> [...]
> >>>
> >>>> +static int dnxhd_init_qmat(DNXHDEncContext *ctx, int lbias, int cbias)
> >>>> +{
> >>>> +    int qscale, i;
> >>>> +    int (*qmat_luma  )[64] = av_mallocz(ctx->avctx->qmax * 64 * sizeof(*qmat_luma));
> >>>> +    int (*qmat_chroma)[64] = av_mallocz(ctx->avctx->qmax * 64 * sizeof(*qmat_chroma));
> >>>> +    uint16_t (*qmat16_luma  )[2][64] = av_mallocz(ctx->avctx->qmax * 64 * 2 * sizeof(*qmat16_luma));
> >>>> +    uint16_t (*qmat16_chroma)[2][64] = av_mallocz(ctx->avctx->qmax * 64 * 2 * sizeof(*qmat16_chroma));
> >>>> +
> >>>> +    for (qscale = 1; qscale < ctx->avctx->qmax; qscale++) {
> >>>> +        for (i = 1; i < 64; i++) {
> >>>> +            int j = ff_zigzag_direct[i];
> >>>> +            qmat_luma  [qscale][j] = (int)((UINT64_C(4) << QMAT_SHIFT) / (qscale * ctx->cid_table->  luma_weigth[i]));
> >>>> +            qmat_chroma[qscale][j] = (int)((UINT64_C(4) << QMAT_SHIFT) / (qscale * ctx->cid_table->chroma_weigth[i]));
> >>>> +
> >>>> +            qmat16_luma  [qscale][0][j] = (4 << QMAT_SHIFT_MMX) / (qscale * ctx->cid_table->  luma_weigth[i]);
> >>>> +            qmat16_chroma[qscale][0][j] = (4 << QMAT_SHIFT_MMX) / (qscale * ctx->cid_table->chroma_weigth[i]);
> >>>> +
> >>>> +            if (qmat16_luma  [qscale][0][j]==0 || qmat16_luma  [qscale][0][j]==128*256) qmat16_luma  [qscale][0][j]=128*256-1;
> >>>> +            if (qmat16_chroma[qscale][0][j]==0 || qmat16_chroma[qscale][0][j]==128*256) qmat16_chroma[qscale][0][j]=128*256-1;
> >>>> +
> >>>> +            qmat16_luma  [qscale][1][j] = ROUNDED_DIV(lbias<<(16-QUANT_BIAS_SHIFT), qmat16_luma  [qscale][0][j]);
> >>>> +            qmat16_chroma[qscale][1][j] = ROUNDED_DIV(cbias<<(16-QUANT_BIAS_SHIFT), qmat16_chroma[qscale][0][j]);
> >>>> +        }
> >>>> +    }
> >>>> +
> >>>> +    ctx->q_scale_matrixl   = qmat_luma;
> >>>> +    ctx->q_scale_matrixc   = qmat_chroma;
> >>>> +    ctx->q_scale_matrixl16 = qmat16_luma;
> >>>> +    ctx->q_scale_matrixc16 = qmat16_chroma;
> >>>> +    return 0;
> >>>> +}
> >>>
> >>> duplicate of convert_matrix()
> >> Yes, there are some tiny differences, and I need more code to readjust
> >> tables, and its seems bigger than duplicating convert_matrix. What
> >> should I do ?
> > 
> > does it work with all DCTs? the faan dct, the slow and fast jpeg dcts?
> > ive not checked but i had the feeling it wont work with all
> > if it works i of course prefer the smaller and simpler solution
> 
> Ok.
> 
> >>> [...]
> >>>
> >>>> +    dsputil_init(&ctx->m.dsp, avctx);
> >>>> +    ff_init_scantable(ctx->m.dsp.idct_permutation, &ctx->m.intra_scantable, ff_zigzag_direct);
> >>>> +
> >>>> +    ctx->m.avctx = avctx;
> >>>> +    ctx->m.dct_quantize = dct_quantize_c;
> >>>
> >>> does this disable the optimized quantize funcions?
> >> Not really, since MPV_common_init_mmx will reassign optimized function
> >> later. I changed it nonetheless.
> >>
> >>> what about quantize_trelis? (no its no reason to delay the patch, making
> >>> trellis work could very well be in a later patch, iam just asking)
> >> Well I'd like the encoder to be commited so I have real history on the
> >> work. I find easier to work when code is in svn (diffs).
> >>
> > 
> >> I had a quick look at trellis but I don't understand it completely yet
> >> (ac_esc_length not set by mpeg2 encoder for example, and levels taken
> >> into account seems different than dnxhd levels, once I get it I could as
> >> well do it for mjpeg)
> > 
> > trellis for mjpeg would be cool :)
> > maybe we could beat jpeg2000 that way ;)
> 
> Or snow ;)
> Seriously, is it wanted or not ? 

trellis quant is wanted for all codecs

> Is there some documentation of the current
> function in mpegvideo_enc.c somewhere ?

documentation ... well ... no, but you can read the source and look up
viterbi algorithm

> 
> > [...]
> >> How does it look like ?
> > 
> > [...]
> > 
> >> Index: libavcodec/dnxhdenc.c
> >> ===================================================================
> >> --- libavcodec/dnxhdenc.c	(revision 0)
> >> +++ libavcodec/dnxhdenc.c	(revision 0)
> > [...]
> >> +//#define DEBUG
> >> +#define RC_VARIANCE 1 // use variance or ssd for fast rc
> >> +
> >> +#include "avcodec.h"
> >> +#include "dsputil.h"
> >> +#include "mpegvideo.h"
> >> +#include "dnxhddata.h"
> >> +
> >> +int dct_quantize_c(MpegEncContext *s, DCTELEM *block, int n, int qscale, int *overflow);
> > 
> > totally unrelated to your patch but this needs a ff_ prefix
> 
> Yes.
> 
> > [...]
> >> +typedef struct DNXHDEncContext {
> >> +    MpegEncContext m; ///< Used for quantization dsp functions
> >> +
> >> +    AVCodecContext *avctx;
> >> +    AVFrame *picture;
> >> +    int cid;
> >> +    const CIDEntry *cid_table;
> > 
> >> +    uint8_t *msips;
> > 
> > whats a msips ? this could benefit from a comment i think
> 
> Better renamed to msip and added comment.
> 
> > [...]
> >> +    ctx->m.avctx = avctx;
> >> +    ctx->m.mb_intra = 1;
> >> +    ctx->m.h263_aic = 1;
> >> +
> >> +    MPV_common_init_mmx(&ctx->m);
> >> +    if (!ctx->m.dct_quantize)
> >> +        ctx->m.dct_quantize = dct_quantize_c;
> > 
> > iam an idiot, i was staring at this for maybe 2 minutes, somehow i knew
> > it was wrong i dont understand why i didnt realize it immedeatly ...
> > 
> > _mmx()
> > 
> > you dont even know if the CPU is a x86 ...
> 
> Added ifdef

NO!, a *codec *muxer has no business calling any *_mmx
think about it, what will happen with people on ppc, on arm, on sparc, ...
also grep, no other codec does that

find out which init function is the correct one and call it dont call
architecture specific functions directly
if theres no way to do what you want currently fix the code so it is
possible cleanly

> 
> > [...]
> >> +                if (i&2) {
> >> +                    n = 1 + (i&1);
> >> +                    ctx->m.q_intra_matrix16 = ctx->qmatrix_c16;
> >> +                    ctx->m.q_intra_matrix   = ctx->qmatrix_c;
> >> +                } else {
> >> +                    n = 0;
> >> +                    ctx->m.q_intra_matrix16 = ctx->qmatrix_l16;
> >> +                    ctx->m.q_intra_matrix   = ctx->qmatrix_l;
> >> +                }
> >> +
> > [...]
> > 
> >> +                if (i&2) {
> >> +                    n = 1 + (i&1);
> >> +                    ctx->m.q_intra_matrix16 = ctx->qmatrix_c16;
> >> +                    ctx->m.q_intra_matrix   = ctx->qmatrix_c;
> >> +                } else {
> >> +                    n = 0;
> >> +                    ctx->m.q_intra_matrix16 = ctx->qmatrix_l16;
> >> +                    ctx->m.q_intra_matrix   = ctx->qmatrix_l;
> >> +                }
> > 
> > duplicated
> 
> Added dnxhd_switch_matrix
> 
> > [...]
> >> +static int dnxhd_find_qscale(DNXHDEncContext *ctx)
> >> +{
> >> +    int bits = 0;
> >> +    int up_step = 1;
> >> +    int down_step = 1;
> >> +    int last_higher = 0;
> >> +    int last_lower = INT_MAX;
> >> +    int qscale;
> >> +    int x, y;
> >> +
> >> +    qscale = ctx->qscale;
> >> +    for (;;) {
> >> +        bits = 0;
> >> +        ctx->qscale = qscale;
> >> +        // XXX avoid recalculating bits
> >> +        ctx->avctx->execute(ctx->avctx, dnxhd_calc_bits_thread, (void**)&ctx->thread[0], NULL, ctx->avctx->thread_count);
> >> +        for (y = 0; y < ctx->m.mb_height; y++) {
> >> +            for (x = 0; x < ctx->m.mb_width; x++)
> >> +                bits += ctx->mb_rc[qscale][y*ctx->m.mb_width+x].bits;
> >> +            bits = (bits+31)&~31; // padding
> >> +            if (bits > ctx->frame_bits)
> >> +                break;
> >> +        }
> >> +        //dprintf(ctx->avctx, "%d, qscale %d, bits %d, frame %d\n", ctx->avctx->frame_number, qscale, bits, ctx->frame_bits);
> >> +        if (bits < ctx->frame_bits) {
> >> +            if (qscale == 1)
> >> +                break;
> >> +            if (last_higher == qscale - 1) {
> >> +                qscale = last_higher;
> >> +                break;
> >> +            }
> >> +            if (qscale < last_lower)
> >> +                last_lower = qscale;
> >> +            qscale -= down_step++;
> > 
> > this looks like it can make qscale negative if bits stays small enough
> 
> Added qscale < 1 test.
> 
> >> +            up_step = 1;
> >> +        } else {
> >> +            if (last_lower == qscale + 1)
> >> +                break;
> >> +            if (qscale > last_higher)
> >> +                last_higher = qscale;
> >> +            qscale += up_step++;
> >> +            down_step = 1;
> >> +            if (qscale >= ctx->avctx->qmax)
> >> +                return -1;
> >> +        }
> > 
> > this is not optimal (speed wise) IMHO
> > if you have a lower and upper qscale then simple bisection should be
> > faster
> 
> Done.
> 
> > [...]
> >> +static int dnxhd_encode_picture(AVCodecContext *avctx, unsigned char *buf, int buf_size, void *data)
> >> +{
> >> +    DNXHDEncContext *ctx = avctx->priv_data;
> >> +    AVFrame *input_frame = data;
> >> +    int first_field = 1;
> >> +    int offset, i, ret;
> >> +
> >> +    if (buf_size < ctx->cid_table->frame_size) {
> >> +        av_log(avctx, AV_LOG_ERROR, "output buffer is too small to compress picture\n");
> >> +        return -1;
> >> +    }
> >> +
> >> +    av_picture_copy((AVPicture *)ctx->picture, data, PIX_FMT_YUV422P, avctx->width, avctx->height);
> > 
> > why do you copy the input image?
> 
> If picture is not allocated as 1920x1088 I need to copy it.

why?

[...]
> +    /* rate control */
> +    unsigned slice_bits;
> +    unsigned qscale;
> +    unsigned lambda;

not doxygen compatible comment

[...]

> +            if (qscale < last_lower)
> +                last_lower = qscale;

FFMIN

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Democracy is the form of government in which you can choose your dictator
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20071006/764e21d3/attachment.pgp>