[FFmpeg-devel] [PATCH] DNxHD/VC-3 encoder
Baptiste Coudurier
baptiste.coudurier
Fri Oct 5 11:53:30 CEST 2007
Hi
Michael Niedermayer wrote:
> Hi
>
> On Wed, Oct 03, 2007 at 10:45:04PM +0200, Baptiste Coudurier wrote:
>> Hi Michael, thanks for the review,
>>
>> Michael Niedermayer wrote:
>>> On Mon, Sep 24, 2007 at 01:00:12PM +0200, Baptiste Coudurier wrote:
>>>
>>>> Hi Michael,
>>>>
>>>> Michael Niedermayer wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> On Fri, Sep 07, 2007 at 08:05:28PM +0200, Michael Niedermayer wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> On Fri, Sep 07, 2007 at 09:17:49AM -0600, Loren Merritt wrote:
>>>>>>
>>>>>>> On Fri, 7 Sep 2007, Michael Niedermayer wrote:
>>>>>>>
>>>>>>>
>>>>>>>> when people start using qsort you know theres something wrong in their
>>>>>>>> encoder (or at least not optimal)
>>>>>>>>
>>>>>>>> so, lets first check if i understand the constraints
>>>>>>>> * its possible to change qscale per macroblock and it doesnt cost anything
>>>>>>>> * theres a fixed number of bits available for each frame (pre mpeg1 design)
>>>>>>>>
>>>>>>>> in that case the optimal solution (though not fastest) is to do a
>>>>>>>> binary search over lambda (using the number of bits and the ones available
>>>>>>>> to find the best lambda which results in a frame which is not too large)
>>>>>>>>
>>>>>>>> then for each such lambda, find the best qscale for each MB, again by
>>>>>>>> binary search minimizing SSD + bits*lambda (that way we know the best
>>>>>>>> qscales and how many bits the frame would need)
>>>>>>> Your algorithm produces the same decisions as the one with qsort. I don't
>>>>>>> know which is faster.
>>>>>>>
>>>>>>> Consider the list of possible block encodings sorted by
>>>>>>> delta_ssd/delta_bits. Optimizing for qps at any given lambda will produce
>>>>>>> some prefix of that list. i.e. for any lambda, there is some
>>>>>>> position in the list such that all preceding entries have ssd/bits
>>>>>>> better than lambda, and all following entries have ssd/bits worse
>>>>>>> than lambda. So instead of evaluating bits(lambda) as a black box and
>>>>>>> searching for a specified values of bits, construct the function
>>>>>>> lambda(bits) and evaluate it at the specified value of bits.
>>>>>> yes, but as far as i can see thats not what the current code does, it just
>>>>>> tries qscale and qscale+1 if it would try all qscale then yes they would
>>>>>> be identical
>>>>> just to clarify my original review ...
>>>>> i dont mind if the code uses qsort, what i was complaining about was that
>>>>> it isnt finding the optimal (in the RD sense) solution
>>>>>
>>>>> if its changed to find the optimal one or its demonstrated that the
>>>>> PSNR/bitrate difference is negligible then iam happy with that part
>>>>>
>>>> I tried lagrange. I keep the qsort rc because it is useable (> real time
>>>> on quad core cpu with variance, -5 fps for ssd), rdo one is encoding at
>>>> 4 fps :/
>>>>
>>>> Im wondering if I did screwed up the implementation because I have an
>>>> IMHO negligeable gain with rdo.
>>>>
>>>> I reused existing quant functions.
>>>>
>>>> Threaded functions must return int else you get a warning.
>>>>
>>>> About encode_dc, it will be complicated to reuse existing ones because:
>>>> mpeg12 use some static tables and generation code.
>>>> mjpeg use uint16_t for dc_codes.
>>>>
>>>> About mb_var_thread, existing function use current_picture and set
>>>> variables I don't need, mb_mean, and use me field too ...
>>>>
>>>> Attached are some psnr results with one HD uncompressed sample I have.
>>>>
>>>> Any other optimization because what Loren suggested earlier ?
>>>
>>> [...]
>>>
>>>> +static int dnxhd_init_qmat(DNXHDEncContext *ctx, int lbias, int cbias)
>>>> +{
>>>> + int qscale, i;
>>>> + int (*qmat_luma )[64] = av_mallocz(ctx->avctx->qmax * 64 * sizeof(*qmat_luma));
>>>> + int (*qmat_chroma)[64] = av_mallocz(ctx->avctx->qmax * 64 * sizeof(*qmat_chroma));
>>>> + uint16_t (*qmat16_luma )[2][64] = av_mallocz(ctx->avctx->qmax * 64 * 2 * sizeof(*qmat16_luma));
>>>> + uint16_t (*qmat16_chroma)[2][64] = av_mallocz(ctx->avctx->qmax * 64 * 2 * sizeof(*qmat16_chroma));
>>>> +
>>>> + for (qscale = 1; qscale < ctx->avctx->qmax; qscale++) {
>>>> + for (i = 1; i < 64; i++) {
>>>> + int j = ff_zigzag_direct[i];
>>>> + qmat_luma [qscale][j] = (int)((UINT64_C(4) << QMAT_SHIFT) / (qscale * ctx->cid_table-> luma_weigth[i]));
>>>> + qmat_chroma[qscale][j] = (int)((UINT64_C(4) << QMAT_SHIFT) / (qscale * ctx->cid_table->chroma_weigth[i]));
>>>> +
>>>> + qmat16_luma [qscale][0][j] = (4 << QMAT_SHIFT_MMX) / (qscale * ctx->cid_table-> luma_weigth[i]);
>>>> + qmat16_chroma[qscale][0][j] = (4 << QMAT_SHIFT_MMX) / (qscale * ctx->cid_table->chroma_weigth[i]);
>>>> +
>>>> + if (qmat16_luma [qscale][0][j]==0 || qmat16_luma [qscale][0][j]==128*256) qmat16_luma [qscale][0][j]=128*256-1;
>>>> + if (qmat16_chroma[qscale][0][j]==0 || qmat16_chroma[qscale][0][j]==128*256) qmat16_chroma[qscale][0][j]=128*256-1;
>>>> +
>>>> + qmat16_luma [qscale][1][j] = ROUNDED_DIV(lbias<<(16-QUANT_BIAS_SHIFT), qmat16_luma [qscale][0][j]);
>>>> + qmat16_chroma[qscale][1][j] = ROUNDED_DIV(cbias<<(16-QUANT_BIAS_SHIFT), qmat16_chroma[qscale][0][j]);
>>>> + }
>>>> + }
>>>> +
>>>> + ctx->q_scale_matrixl = qmat_luma;
>>>> + ctx->q_scale_matrixc = qmat_chroma;
>>>> + ctx->q_scale_matrixl16 = qmat16_luma;
>>>> + ctx->q_scale_matrixc16 = qmat16_chroma;
>>>> + return 0;
>>>> +}
>>>
>>> duplicate of convert_matrix()
>> Yes, there are some tiny differences, and I need more code to readjust
>> tables, and its seems bigger than duplicating convert_matrix. What
>> should I do ?
>
> does it work with all DCTs? the faan dct, the slow and fast jpeg dcts?
> ive not checked but i had the feeling it wont work with all
> if it works i of course prefer the smaller and simpler solution
Ok.
>>> [...]
>>>
>>>> + dsputil_init(&ctx->m.dsp, avctx);
>>>> + ff_init_scantable(ctx->m.dsp.idct_permutation, &ctx->m.intra_scantable, ff_zigzag_direct);
>>>> +
>>>> + ctx->m.avctx = avctx;
>>>> + ctx->m.dct_quantize = dct_quantize_c;
>>>
>>> does this disable the optimized quantize funcions?
>> Not really, since MPV_common_init_mmx will reassign optimized function
>> later. I changed it nonetheless.
>>
>>> what about quantize_trelis? (no its no reason to delay the patch, making
>>> trellis work could very well be in a later patch, iam just asking)
>> Well I'd like the encoder to be commited so I have real history on the
>> work. I find easier to work when code is in svn (diffs).
>>
>
>> I had a quick look at trellis but I don't understand it completely yet
>> (ac_esc_length not set by mpeg2 encoder for example, and levels taken
>> into account seems different than dnxhd levels, once I get it I could as
>> well do it for mjpeg)
>
> trellis for mjpeg would be cool :)
> maybe we could beat jpeg2000 that way ;)
Or snow ;)
Seriously, is it wanted or not ? Is there some documentation of the current
function in mpegvideo_enc.c somewhere ?
> [...]
>> How does it look like ?
>
> [...]
>
>> Index: libavcodec/dnxhdenc.c
>> ===================================================================
>> --- libavcodec/dnxhdenc.c (revision 0)
>> +++ libavcodec/dnxhdenc.c (revision 0)
> [...]
>> +//#define DEBUG
>> +#define RC_VARIANCE 1 // use variance or ssd for fast rc
>> +
>> +#include "avcodec.h"
>> +#include "dsputil.h"
>> +#include "mpegvideo.h"
>> +#include "dnxhddata.h"
>> +
>> +int dct_quantize_c(MpegEncContext *s, DCTELEM *block, int n, int qscale, int *overflow);
>
> totally unrelated to your patch but this needs a ff_ prefix
Yes.
> [...]
>> +typedef struct DNXHDEncContext {
>> + MpegEncContext m; ///< Used for quantization dsp functions
>> +
>> + AVCodecContext *avctx;
>> + AVFrame *picture;
>> + int cid;
>> + const CIDEntry *cid_table;
>
>> + uint8_t *msips;
>
> whats a msips ? this could benefit from a comment i think
Better renamed to msip and added comment.
> [...]
>> + ctx->m.avctx = avctx;
>> + ctx->m.mb_intra = 1;
>> + ctx->m.h263_aic = 1;
>> +
>> + MPV_common_init_mmx(&ctx->m);
>> + if (!ctx->m.dct_quantize)
>> + ctx->m.dct_quantize = dct_quantize_c;
>
> iam an idiot, i was staring at this for maybe 2 minutes, somehow i knew
> it was wrong i dont understand why i didnt realize it immedeatly ...
>
> _mmx()
>
> you dont even know if the CPU is a x86 ...
Added ifdef
> [...]
>> + if (i&2) {
>> + n = 1 + (i&1);
>> + ctx->m.q_intra_matrix16 = ctx->qmatrix_c16;
>> + ctx->m.q_intra_matrix = ctx->qmatrix_c;
>> + } else {
>> + n = 0;
>> + ctx->m.q_intra_matrix16 = ctx->qmatrix_l16;
>> + ctx->m.q_intra_matrix = ctx->qmatrix_l;
>> + }
>> +
> [...]
>
>> + if (i&2) {
>> + n = 1 + (i&1);
>> + ctx->m.q_intra_matrix16 = ctx->qmatrix_c16;
>> + ctx->m.q_intra_matrix = ctx->qmatrix_c;
>> + } else {
>> + n = 0;
>> + ctx->m.q_intra_matrix16 = ctx->qmatrix_l16;
>> + ctx->m.q_intra_matrix = ctx->qmatrix_l;
>> + }
>
> duplicated
Added dnxhd_switch_matrix
> [...]
>> +static int dnxhd_find_qscale(DNXHDEncContext *ctx)
>> +{
>> + int bits = 0;
>> + int up_step = 1;
>> + int down_step = 1;
>> + int last_higher = 0;
>> + int last_lower = INT_MAX;
>> + int qscale;
>> + int x, y;
>> +
>> + qscale = ctx->qscale;
>> + for (;;) {
>> + bits = 0;
>> + ctx->qscale = qscale;
>> + // XXX avoid recalculating bits
>> + ctx->avctx->execute(ctx->avctx, dnxhd_calc_bits_thread, (void**)&ctx->thread[0], NULL, ctx->avctx->thread_count);
>> + for (y = 0; y < ctx->m.mb_height; y++) {
>> + for (x = 0; x < ctx->m.mb_width; x++)
>> + bits += ctx->mb_rc[qscale][y*ctx->m.mb_width+x].bits;
>> + bits = (bits+31)&~31; // padding
>> + if (bits > ctx->frame_bits)
>> + break;
>> + }
>> + //dprintf(ctx->avctx, "%d, qscale %d, bits %d, frame %d\n", ctx->avctx->frame_number, qscale, bits, ctx->frame_bits);
>> + if (bits < ctx->frame_bits) {
>> + if (qscale == 1)
>> + break;
>> + if (last_higher == qscale - 1) {
>> + qscale = last_higher;
>> + break;
>> + }
>> + if (qscale < last_lower)
>> + last_lower = qscale;
>> + qscale -= down_step++;
>
> this looks like it can make qscale negative if bits stays small enough
Added qscale < 1 test.
>> + up_step = 1;
>> + } else {
>> + if (last_lower == qscale + 1)
>> + break;
>> + if (qscale > last_higher)
>> + last_higher = qscale;
>> + qscale += up_step++;
>> + down_step = 1;
>> + if (qscale >= ctx->avctx->qmax)
>> + return -1;
>> + }
>
> this is not optimal (speed wise) IMHO
> if you have a lower and upper qscale then simple bisection should be
> faster
Done.
> [...]
>> +static int dnxhd_encode_picture(AVCodecContext *avctx, unsigned char *buf, int buf_size, void *data)
>> +{
>> + DNXHDEncContext *ctx = avctx->priv_data;
>> + AVFrame *input_frame = data;
>> + int first_field = 1;
>> + int offset, i, ret;
>> +
>> + if (buf_size < ctx->cid_table->frame_size) {
>> + av_log(avctx, AV_LOG_ERROR, "output buffer is too small to compress picture\n");
>> + return -1;
>> + }
>> +
>> + av_picture_copy((AVPicture *)ctx->picture, data, PIX_FMT_YUV422P, avctx->width, avctx->height);
>
> why do you copy the input image?
If picture is not allocated as 1920x1088 I need to copy it.
Added dnxhd_load_picture and do not copy it anymore if not needed.
--
Baptiste COUDURIER GnuPG Key Id: 0x5C1ABAAA
SMARTJOG S.A. http://www.smartjog.com
Key fingerprint 8D77134D20CC9220201FC5DB0AC9325C5C1ABAAA
Phone: +33 1 49966312
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dnxhd_encoder7.patch
Type: text/x-diff
Size: 34762 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20071005/9f99d701/attachment.patch>
More information about the ffmpeg-devel
mailing list