[FFmpeg-devel] [PATCH] DNxHD/VC-3 encoder

Thu Oct 4 01:26:15 CEST 2007

Hi

On Wed, Oct 03, 2007 at 10:45:04PM +0200, Baptiste Coudurier wrote:
> Hi Michael, thanks for the review,
> 
> Michael Niedermayer wrote:
> > On Mon, Sep 24, 2007 at 01:00:12PM +0200, Baptiste Coudurier wrote:
> > 
> >>Hi Michael,
> >>
> >>Michael Niedermayer wrote:
> >>
> >>>Hi
> >>>
> >>>On Fri, Sep 07, 2007 at 08:05:28PM +0200, Michael Niedermayer wrote:
> >>>
> >>>>Hi
> >>>>
> >>>>On Fri, Sep 07, 2007 at 09:17:49AM -0600, Loren Merritt wrote:
> >>>>
> >>>>>On Fri, 7 Sep 2007, Michael Niedermayer wrote:
> >>>>>
> >>>>>
> >>>>>>when people start using qsort you know theres something wrong in their
> >>>>>>encoder (or at least not optimal)
> >>>>>>
> >>>>>>so, lets first check if i understand the constraints
> >>>>>>* its possible to change qscale per macroblock and it doesnt cost anything
> >>>>>>* theres a fixed number of bits available for each frame (pre mpeg1 design)
> >>>>>>
> >>>>>>in that case the optimal solution (though not fastest) is to do a
> >>>>>>binary search over lambda (using the number of bits and the ones available
> >>>>>>to find the best lambda which results in a frame which is not too large)
> >>>>>>
> >>>>>>then for each such lambda, find the best qscale for each MB, again by
> >>>>>>binary search minimizing SSD + bits*lambda (that way we know the best
> >>>>>>qscales and how many bits the frame would need)
> >>>>>
> >>>>>Your algorithm produces the same decisions as the one with qsort. I don't 
> >>>>>know which is faster.
> >>>>>
> >>>>>Consider the list of possible block encodings sorted by 
> >>>>>delta_ssd/delta_bits. Optimizing for qps at any given lambda will produce 
> >>>>>some prefix of that list. i.e. for any lambda, there is some 
> >>>>>position in the list such that all preceding entries have ssd/bits 
> >>>>>better than lambda, and all following entries have ssd/bits worse 
> >>>>>than lambda. So instead of evaluating bits(lambda) as a black box and 
> >>>>>searching for a specified values of bits, construct the function 
> >>>>>lambda(bits) and evaluate it at the specified value of bits.
> >>>>
> >>>>yes, but as far as i can see thats not what the current code does, it just
> >>>>tries qscale and qscale+1 if it would try all qscale then yes they would
> >>>>be identical
> >>>
> >>>just to clarify my original review ...
> >>>i dont mind if the code uses qsort, what i was complaining about was that
> >>>it isnt finding the optimal (in the RD sense) solution
> >>>
> >>>if its changed to find the optimal one or its demonstrated that the
> >>>PSNR/bitrate difference is negligible then iam happy with that part
> >>>
> >>
> >>I tried lagrange. I keep the qsort rc because it is useable (> real time
> >>on quad core cpu with variance, -5 fps for ssd), rdo one is encoding at
> >>4 fps :/
> >>
> >>Im wondering if I did screwed up the implementation because I have an
> >>IMHO negligeable gain with rdo.
> >>
> >>I reused existing quant functions.
> >>
> >>Threaded functions must return int else you get a warning.
> >>
> >>About encode_dc, it will be complicated to reuse existing ones because:
> >>mpeg12 use some static tables and generation code.
> >>mjpeg use uint16_t for dc_codes.
> >>
> >>About mb_var_thread, existing function use current_picture and set
> >>variables I don't need, mb_mean, and use me field too ...
> >>
> >>Attached are some psnr results with one HD uncompressed sample I have.
> >>
> >>Any other optimization because what Loren suggested earlier ?
> > 
> > 
> > [...]
> > 
> >>+static int dnxhd_init_qmat(DNXHDEncContext *ctx, int lbias, int cbias)
> >>+{
> >>+    int qscale, i;
> >>+    int (*qmat_luma  )[64] = av_mallocz(ctx->avctx->qmax * 64 * sizeof(*qmat_luma));
> >>+    int (*qmat_chroma)[64] = av_mallocz(ctx->avctx->qmax * 64 * sizeof(*qmat_chroma));
> >>+    uint16_t (*qmat16_luma  )[2][64] = av_mallocz(ctx->avctx->qmax * 64 * 2 * sizeof(*qmat16_luma));
> >>+    uint16_t (*qmat16_chroma)[2][64] = av_mallocz(ctx->avctx->qmax * 64 * 2 * sizeof(*qmat16_chroma));
> >>+
> >>+    for (qscale = 1; qscale < ctx->avctx->qmax; qscale++) {
> >>+        for (i = 1; i < 64; i++) {
> >>+            int j = ff_zigzag_direct[i];
> >>+            qmat_luma  [qscale][j] = (int)((UINT64_C(4) << QMAT_SHIFT) / (qscale * ctx->cid_table->  luma_weigth[i]));
> >>+            qmat_chroma[qscale][j] = (int)((UINT64_C(4) << QMAT_SHIFT) / (qscale * ctx->cid_table->chroma_weigth[i]));
> >>+
> >>+            qmat16_luma  [qscale][0][j] = (4 << QMAT_SHIFT_MMX) / (qscale * ctx->cid_table->  luma_weigth[i]);
> >>+            qmat16_chroma[qscale][0][j] = (4 << QMAT_SHIFT_MMX) / (qscale * ctx->cid_table->chroma_weigth[i]);
> >>+
> >>+            if (qmat16_luma  [qscale][0][j]==0 || qmat16_luma  [qscale][0][j]==128*256) qmat16_luma  [qscale][0][j]=128*256-1;
> >>+            if (qmat16_chroma[qscale][0][j]==0 || qmat16_chroma[qscale][0][j]==128*256) qmat16_chroma[qscale][0][j]=128*256-1;
> >>+
> >>+            qmat16_luma  [qscale][1][j] = ROUNDED_DIV(lbias<<(16-QUANT_BIAS_SHIFT), qmat16_luma  [qscale][0][j]);
> >>+            qmat16_chroma[qscale][1][j] = ROUNDED_DIV(cbias<<(16-QUANT_BIAS_SHIFT), qmat16_chroma[qscale][0][j]);
> >>+        }
> >>+    }
> >>+
> >>+    ctx->q_scale_matrixl   = qmat_luma;
> >>+    ctx->q_scale_matrixc   = qmat_chroma;
> >>+    ctx->q_scale_matrixl16 = qmat16_luma;
> >>+    ctx->q_scale_matrixc16 = qmat16_chroma;
> >>+    return 0;
> >>+}
> > 
> > 
> > duplicate of convert_matrix()
> 
> Yes, there are some tiny differences, and I need more code to readjust
> tables, and its seems bigger than duplicating convert_matrix. What
> should I do ?

does it work with all DCTs? the faan dct, the slow and fast jpeg dcts?
ive not checked but i had the feeling it wont work with all
if it works i of course prefer the smaller and simpler solution

> 
> > [...]
> > 
> >>+    dsputil_init(&ctx->m.dsp, avctx);
> >>+    ff_init_scantable(ctx->m.dsp.idct_permutation, &ctx->m.intra_scantable, ff_zigzag_direct);
> >>+
> >>+    ctx->m.avctx = avctx;
> >>+    ctx->m.dct_quantize = dct_quantize_c;
> > 
> > 
> > does this disable the optimized quantize funcions?
> 
> Not really, since MPV_common_init_mmx will reassign optimized function
> later. I changed it nonetheless.
> 
> > what about quantize_trelis? (no its no reason to delay the patch, making
> > trellis work could very well be in a later patch, iam just asking)
> 
> Well I'd like the encoder to be commited so I have real history on the
> work. I find easier to work when code is in svn (diffs).
> 

> I had a quick look at trellis but I don't understand it completely yet
> (ac_esc_length not set by mpeg2 encoder for example, and levels taken
> into account seems different than dnxhd levels, once I get it I could as
> well do it for mjpeg)

trellis for mjpeg would be cool :)
maybe we could beat jpeg2000 that way ;)

[...]
> 
> How does it look like ?

[...]

> Index: libavcodec/dnxhdenc.c
> ===================================================================
> --- libavcodec/dnxhdenc.c	(revision 0)
> +++ libavcodec/dnxhdenc.c	(revision 0)
[...]
> +//#define DEBUG
> +#define RC_VARIANCE 1 // use variance or ssd for fast rc
> +
> +#include "avcodec.h"
> +#include "dsputil.h"
> +#include "mpegvideo.h"
> +#include "dnxhddata.h"
> +
> +int dct_quantize_c(MpegEncContext *s, DCTELEM *block, int n, int qscale, int *overflow);

totally unrelated to your patch but this needs a ff_ prefix

[...]
> +typedef struct DNXHDEncContext {
> +    MpegEncContext m; ///< Used for quantization dsp functions
> +
> +    AVCodecContext *avctx;
> +    AVFrame *picture;
> +    int cid;
> +    const CIDEntry *cid_table;

> +    uint8_t *msips;

whats a msips ? this could benefit from a comment i think

[...]
> +    ctx->m.avctx = avctx;
> +    ctx->m.mb_intra = 1;
> +    ctx->m.h263_aic = 1;
> +
> +    MPV_common_init_mmx(&ctx->m);
> +    if (!ctx->m.dct_quantize)
> +        ctx->m.dct_quantize = dct_quantize_c;

iam an idiot, i was staring at this for maybe 2 minutes, somehow i knew
it was wrong i dont understand why i didnt realize it immedeatly ...

_mmx()

you dont even know if the CPU is a x86 ...

[...]
> +                if (i&2) {
> +                    n = 1 + (i&1);
> +                    ctx->m.q_intra_matrix16 = ctx->qmatrix_c16;
> +                    ctx->m.q_intra_matrix   = ctx->qmatrix_c;
> +                } else {
> +                    n = 0;
> +                    ctx->m.q_intra_matrix16 = ctx->qmatrix_l16;
> +                    ctx->m.q_intra_matrix   = ctx->qmatrix_l;
> +                }
> +
[...]

> +                if (i&2) {
> +                    n = 1 + (i&1);
> +                    ctx->m.q_intra_matrix16 = ctx->qmatrix_c16;
> +                    ctx->m.q_intra_matrix   = ctx->qmatrix_c;
> +                } else {
> +                    n = 0;
> +                    ctx->m.q_intra_matrix16 = ctx->qmatrix_l16;
> +                    ctx->m.q_intra_matrix   = ctx->qmatrix_l;
> +                }

duplicated

[...]
> +static int dnxhd_find_qscale(DNXHDEncContext *ctx)
> +{
> +    int bits = 0;
> +    int up_step = 1;
> +    int down_step = 1;
> +    int last_higher = 0;
> +    int last_lower = INT_MAX;
> +    int qscale;
> +    int x, y;
> +
> +    qscale = ctx->qscale;
> +    for (;;) {
> +        bits = 0;
> +        ctx->qscale = qscale;
> +        // XXX avoid recalculating bits
> +        ctx->avctx->execute(ctx->avctx, dnxhd_calc_bits_thread, (void**)&ctx->thread[0], NULL, ctx->avctx->thread_count);
> +        for (y = 0; y < ctx->m.mb_height; y++) {
> +            for (x = 0; x < ctx->m.mb_width; x++)
> +                bits += ctx->mb_rc[qscale][y*ctx->m.mb_width+x].bits;
> +            bits = (bits+31)&~31; // padding
> +            if (bits > ctx->frame_bits)
> +                break;
> +        }
> +        //dprintf(ctx->avctx, "%d, qscale %d, bits %d, frame %d\n", ctx->avctx->frame_number, qscale, bits, ctx->frame_bits);
> +        if (bits < ctx->frame_bits) {
> +            if (qscale == 1)
> +                break;
> +            if (last_higher == qscale - 1) {
> +                qscale = last_higher;
> +                break;
> +            }
> +            if (qscale < last_lower)
> +                last_lower = qscale;
> +            qscale -= down_step++;

this looks like it can make qscale negative if bits stays small enough

> +            up_step = 1;
> +        } else {
> +            if (last_lower == qscale + 1)
> +                break;
> +            if (qscale > last_higher)
> +                last_higher = qscale;
> +            qscale += up_step++;
> +            down_step = 1;
> +            if (qscale >= ctx->avctx->qmax)
> +                return -1;
> +        }

this is not optimal (speed wise) IMHO
if you have a lower and upper qscale then simple bisection should be
faster

[...]
> +static int dnxhd_encode_picture(AVCodecContext *avctx, unsigned char *buf, int buf_size, void *data)
> +{
> +    DNXHDEncContext *ctx = avctx->priv_data;
> +    AVFrame *input_frame = data;
> +    int first_field = 1;
> +    int offset, i, ret;
> +
> +    if (buf_size < ctx->cid_table->frame_size) {
> +        av_log(avctx, AV_LOG_ERROR, "output buffer is too small to compress picture\n");
> +        return -1;
> +    }
> +
> +    av_picture_copy((AVPicture *)ctx->picture, data, PIX_FMT_YUV422P, avctx->width, avctx->height);

why do you copy the input image?

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Good people do not need laws to tell them to act responsibly, while bad
people will find a way around the laws. -- Plato
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20071004/9e2bcedc/attachment.pgp>