[FFmpeg-soc] Review of dirac code

Thu Aug 16 22:03:42 CEST 2007

Hi

On Thu, Aug 16, 2007 at 07:37:24PM +0200, Marco Gerards wrote:
[...]
> > [...]
> >
> > dirac_arith.h:
> > [...]
> >> typedef struct dirac_arith_state {
> >>     /* Arithmetic decoding.  */
> >>     unsigned int low;
> >>     unsigned int range;
> >>     unsigned int code;
> >>     unsigned int bits_left;
> >>     int carry;
> >>     unsigned int contexts[ARITH_CONTEXT_COUNT];
> >> 
> >>     GetBitContext *gb;
> >>     PutBitContext *pb;
> >
> > GetBitContext gb;
> > PutBitContext pb;
> > would avoid a pointer dereference if these 2 are used in the innermost
> > loops, though maybe their use can be avoided by reading/writing 8bit
> > at a time similar to how our cabac reader and range coder works
> 
> Can I just copy the GetBitContext, PutBitContext at initialization
> time, or is there a more efficient way to do this?

no, its ok

> 
> As for using bytes, perhaps I could do this at a later time.  I think
> this is quite complex and thus time consuming for me.  Or is this a
> must have for you atm?

no its not important, it can be done later

[...]
> >>     { 4, 0, 1, 12, 16, 12, 16, 2, 1, 1, 1, 1, 1, 4, 3, 1, 1, 8, 6, 12, 8, 48, 48, 768  },
> >>     { 4, 0, 1, 16, 24, 16, 24, 2, 1, 1, 1, 1, 1, 4, 3, 1, 1, 8, 6, 12, 8, 48, 48, 1024 },
> >>     { 4, 6, 1, 16, 24, 16, 24, 2, 1, 1, 1, 1, 1, 4, 3, 1, 1, 8, 6, 12, 8, 48, 48, 1024 },
> >>     { 4, 6, 0, 16, 24, 16, 24, 2, 1, 1, 1, 1, 1, 4, 3, 1, 1, 8, 6, 12, 8, 48, 48, 1024 }
> >> };
> >
> > maybe char/short or (u)intXY_t could be used in the structs for these tables
> > so as to reduce the size of the tables (except for things accessed in the
> > innermost loops, there int might be faster on some architectures ...)
> 
> Done.
> 
> Is there some rule of thumb to use for this?  So when to use ints so
> it will be faster on same architectures vs. using small datatypes to
> be more cache friendly?

that is a good question
i tend to use the smallest data type possible for arrays and int for
everything else unless theres some other data type needed

though thats just what i do, i dont know how often this is the best choice
when in doubt and the code is importanat, benchmark (preferably on several
architectures...)

[...]
> >
> > [...]
> >>     /* Framerate.  */
> >>     if (get_bits(gb, 1)) {
> >>         int idx = svq3_get_ue_golomb(gb);
> >>         if (! idx) {
> >>             s->source.frame_rate.num = svq3_get_ue_golomb(gb);
> >>             s->source.frame_rate.den = svq3_get_ue_golomb(gb);
> >>         } else {
> >>             /* Use a pre-set framerate.  */
> >>             s->source.frame_rate = preset_frame_rates[idx - 1];
> >>         }
> >
> > idx should be checked against the table size
> 
> I still have to do this.  Why is this needed actually?  For broken
> encoders or so?

well if you dont check the code can segfault, and there may be situations
where this can be annoying though its not really important
there are probably plenty of ways to make libavcodec segfault if someone
wanted to achive that

[...]
> > [...]
> >> static int inline coeff_quant_factor(int idx) {
> >>     uint64_t base;
> >>     idx = FFMAX(idx, 0);
> >
> > is <0 allowed? if no this should be checked for somewhere and lead to failure
> > to decode the frame
> 
> I think it is.  At least this check is required by the specification.

if its allowed then iam curious how it can become negative, except by
a too large golomb_ue which happens to set the sign bit ...
this sounds a little odd to be intended behavior ...

[...]
> >> 
> >>     if (! magnitude)
> >>         return 0;
> >
> > assuming qfactor is not 0 this check could be done before the multiply
> 
> It seems that it can become 0, although this seems a bit weird to me.

i dont see how it could become 0 ...

[...]
> >> 
> >>     /* Check if there is a zero to the left and top left of this
> >>        coefficient.  */
> >>     if (v > 0 && ((data[x + (y - 1) * s->padded_width])
> >>                   || ( h > 0 && data[x + (y - 1) * s->padded_width - 1])))
> >>         return 0;
> >>     else if (h > 0 && data[x + y * s->padded_width - 1])
> >>         return 0;
> >> 
> >>     return 1;
> >
> > if you make the data array slightly bigger and give it a 0 border 
> > (this might be too hard)
> > or if you have seperate functions for the border and middle(h>0,v>0)
> > then this would be just a
> > return !(data[-s->padded_width]|data[-1]);
> > assuming data has already been corrected to point to x+y*padded_width
> 
> In that can I would have to make two versions of the calling function,
> right?  Or I would have to introduce an if there, but that would be
> moving the problem around, I think?

there are many possible solutions ...
2 functions is one

a
static inline zero_neighbourhood(..., int border){
}

which is called as zero_neighbourhood(..., 0) or 
 zero_neighbourhood(..., 1)

would be anoter one, here gcc should with some luck inline the function
and optimize the part out which isnt needed ...

[...]
> > [...]
> >> /**
> >>  * Predict the motion vector
> >>  *
> >>  * @param x    horizontal position of the MC block
> >>  * @param y    vertical position of the MC block
> >>  * @param ref reference frame
> >>  * @param dir direction horizontal=0, vertical=1
> >>  */
> >> static int motion_vector_prediction(DiracContext *s, int x, int y,
> >>                                     int ref, int dir) {
> >>     int cnt = 0;
> >>     int left = 0, top = 0, lefttop = 0;
> >> 
> >>     if (x > 0) {
> >>         /* Test if the block to the left has a motion vector for this
> >>            reference frame.  */
> >>         if (!s->blmotion[y * s->blwidth + x - 1].use_global
> >>             && s->blmotion[y * s->blwidth + x - 1].use_ref[ref]) {
> >>             left = s->blmotion[y * s->blwidth + x - 1].vect[ref][dir];
> >>             cnt++;
> >>         }
> >> 
> >>         /* This is the only reference, return it.  */
> >>         if (y == 0)
> >>             return left;
> >>     }
> >> 
> >>     if (y > 0) {
> >>         /* Test if the block above the current one has a motion vector
> >>            for this reference frame.  */
> >>         if (!s->blmotion[(y - 1) * s->blwidth + x].use_global
> >>             && s->blmotion[(y - 1) * s->blwidth + x].use_ref[ref]) {
> >>             top = s->blmotion[(y - 1) * s->blwidth + x].vect[ref][dir];
> >>             cnt++;
> >>             }
> >> 
> >>         /* This is the only reference, return it.  */
> >>         if (x == 0)
> >>             return top;
> >>     }
> >> 
> >>     if (x > 0 && y > 0) {
> >>         /* Test if the block above the current one has a motion vector
> >>            for this reference frame.  */
> >>         if (!s->blmotion[(y - 1) * s->blwidth + x - 1].use_global
> >>             && s->blmotion[(y - 1) * s->blwidth + x - 1].use_ref[ref]) {
> >>             lefttop = s->blmotion[(y - 1) * s->blwidth + x - 1].vect[ref][dir];
> >>             cnt++;
> >>         }
> >>     }
> >> 
> >>     /* No references for the prediction.  */
> >>     if (cnt == 0)
> >>         return 0;
> >> 
> >>     if (cnt == 1)
> >>         return left + top + lefttop;
> >> 
> >>     /* Return the median of two motion vectors.  */
> >>     if (cnt == 2)
> >>         return (left + top + lefttop + 1) >> 1;
> >> 
> >>     /* Return the median of three motion vectors.  */
> >>     return mid_pred(left, top, lefttop);
> >> }
> >
> > all the checks above are done twice, once with dir=0 then dir=1 maybe that
> > can be avoided though it seems diracs design doesnt make that easy :(
> 
> I am afraid that won't improve much and just makes the code unreadable
> :-/.  Would you mind if I leave the current code like it is?

leave it, but flame the dirac devels for the crappy design they should
store the dir=0 and dir=1 values nicely in pairs not oddly partitioned

[...]
> > [...]
> >> /**
> >>  * IDWT transform (5,3) for a specific subband
> >>  *
> >>  * @param data coefficients to transform
> >>  * @param level level of the current transform
> >>  * @return 0 when successful, otherwise -1 is returned
> >>  */
> >> static int dirac_subband_idwt_53(DiracContext *s,
> >>                                  int16_t *data, int level) {
> >
> > the whole (inverse) wavelet transform related code could be split into its
> > own file
> 
> What is the filename you propose?  dirac_wavelet.[c]h or so?  Or would
> something more generic be more appropriate so snow and jpeg2000 can
> share code or at least a file?

its very easy to change the filename later in svn as well as git so it
really doesnt matter dirac_wavelet.c/h sounds ok

> 
> >>     int16_t *synth, *synthline;
> >>     int x, y;
> >>     int width = subband_width(s, level);
> >>     int height = subband_height(s, level);
> >>     int synth_width = width  << 1;
> >>     int synth_height = height << 1;
> >> 
> >> START_TIMER
> >> 
> >>     synth = av_malloc(synth_width * synth_height * sizeof(int16_t));
> >>     if (!synth) {
> >>         av_log(s->avctx, AV_LOG_ERROR, "av_malloc() failed\n");
> >>         return -1;
> >>     }
> >
> > theres no need to do a malloc() during idwt
> 
> It is used for IDWT reordering.  Do you mean that this can be done in
> place?

i meant that the malloc() can be done during init instead of during
decode_frame()

and yes it can also be done in place too or the reordering can be
done during the lifting ...

> 
> >
> >> 
> >>     dirac_subband_idwt_reorder(s, data, synth, level);
> >> 
> >>     /* LeGall(5,3)
> >>        First lifting step)
> >>        Even, predict, s=5, t_{-1}=-1, t_0=9, t_1=9, t_2=-1:
> >>          A[2*n]   -= (-A[2*n-1] + A[2*n+1] + 2) >> 2
> >> 
> >>        Second lifting step)
> >>        Odd, update, s=1, t_0=1, t_1=1:
> >>          A[2*n+1] += (A[2*n] + A[2*n+2] + 1) >> 1
> >>     */
> >> 
> >>     /* Vertical synthesis: Lifting stage 1.  */
> >>     synthline = synth;
> >>     for (x = 0; x < synth_width; x++) {
> >>         synthline[x] -= (synthline[synth_width + x]
> >>                        + synthline[synth_width + x]
> >>                        + 2) >> 2;
> >>     }
> >>     synthline = synth + (synth_width << 1);
> >>     for (y = 1; y < height - 1; y++) {
> >>         for (x = 0; x < synth_width; x++) {
> >>             synthline[x] -= (synthline[x - synth_width]
> >>                            + synthline[x + synth_width]
> >>                            + 2) >> 2;
> >>         }
> >>         synthline += (synth_width << 1);
> >>     }
> >>     synthline = synth + (synth_height - 2) * synth_width;
> >>     for (x = 0; x < synth_width; x++) {
> >>         synthline[x] -= (synthline[x - synth_width]
> >>                        + synthline[x + synth_width]
> >>                        + 2) >> 2;
> >>     }
> >> 
> >>     /* Vertical synthesis: Lifting stage 2.  */
> >>     synthline = synth + synth_width;
> >>     for (x = 0; x < synth_width; x++)
> >>         synthline[x] += (synthline[x - synth_width]
> >>                        + synthline[x + synth_width]
> >>                        + 1) >> 1;
> >>     synthline = synth + (synth_width << 1);
> >>     for (y = 1; y < height - 1; y++) {
> >>         for (x = 0; x < synth_width; x++) {
> >>             synthline[x + synth_width] += (synthline[x]
> >>                                          + synthline[x + synth_width * 2]
> >>                                          + 1) >> 1;
> >>         }
> >>         synthline += (synth_width << 1);
> >>     }
> >>     synthline = synth + (synth_height - 1) * synth_width;
> >>     for (x = 0; x < synth_width; x++)
> >>         synthline[x] += (synthline[x - synth_width]
> >>                        + synthline[x - synth_width]
> >>                        + 1) >> 1;
> >> 
> >> 
> >>     /* Horizontal synthesis.  */
> >>     synthline = synth;
> >>     for (y = 0; y < synth_height; y++) {
> >> 
> >>         /* Lifting stage 1.  */
> >>         synthline[0] -= (synthline[1]
> >>                        + synthline[1]
> >>                        + 2) >> 2;
> >>         for (x = 1; x < width - 1; x++) {
> >>             synthline[2*x] -= (synthline[2*x - 1]
> >>                              + synthline[2*x + 1]
> >>                              + 2) >> 2;
> >>         }
> >>         synthline[synth_width - 2] -= (synthline[synth_width - 3]
> >>                                      + synthline[synth_width - 1]
> >>                                      + 2) >> 2;
> >> 
> >>         /* Lifting stage 2.  */
> >>         for (x = 0; x < width - 1; x++) {
> >>             synthline[2*x + 1] += (synthline[2*x]
> >>                                  + synthline[2*x + 2]
> >>                                  + 1) >> 1;
> >>         }
> >>         synthline[synth_width - 1] += (synthline[synth_width - 2]
> >>                                      + synthline[synth_width - 2]
> >>                                      + 1) >> 1;
> >>         synthline += synth_width;
> >>     }
> >
> > as already said the stages can be interleaved so that only a single pass
> > is done over the data, this should significantly speed the code up
> > as its likely limited by memory speed ...
> 
> Yes...  I now tried this (for 5/3), but I am afraid I did this the
> wrong way.  I assume I have to be very careful and apply the second
> stage only for lines that will not be used by the first stage anymore?

yes, the stuff must be done in order

[...]
> >
> > this is VERY inefficient, most of what is done here does not need to be done
> > per pixel or can be precalculated
> 
> Absolutely!  gprof agrees with you ;-)
> 
> Actually, this comment applies to all code above, I assume.
> 
>  time   seconds   seconds    calls  ms/call  ms/call  name    
>  40.85      2.41     2.41 88939320     0.00     0.00  upconvert
>  29.49      4.15     1.74       50    34.80   113.59  dirac_decode_frame
>  10.34      4.76     0.61   417867     0.00     0.00  motion_comp_block1ref
>   6.78      5.16     0.40      540     0.74     0.74  dirac_subband_idwt_53
>   4.24      5.41     0.25  8689089     0.00     0.00  coeff_unpack
>   4.07      5.65     0.24  8707719     0.00     0.00  dirac_arith_read_uint
> 
> Luca and I talked about if I would concentrate on further
> optimizations or on the encoder and he said he preferred having me
> working on the encoder for now, because I know Dirac quite well by now
> (Luca, please tell me if I am wrong in any way).
> 
> So there is quite some room for optimizations.  In the specification,
> the used pseudo code looped over the pixels.  My current code at least
> doesn't do that.  But I agree, the MC code is where a lot of
> optimizations can be made:
> 
> - The qpel/eightpel interpolation code is very inefficient and perhaps
>   has to be merged with motion_comp_block1ref and
>   motion_comp_block2ref.  In that case it can be optimized.  We
>   especially have to take care of border cases to avoid clipping, more
>   or less like I did for halfpel interpolation.

simply making upconvert() work not with 1 pixel but with a block of pixels
should make it alot more efficient

> 
> Perhaps we have to upconvert the entire frame at beforehand like I did
> for halfpel interpolation, but this might be inefficient (because of
> caching).  I am not sure what would be more efficient, for this
> judgement I simply lack the experience.
> 

> - The spatial weighting matrix is the same for the center of the
>   screen, at the borders at certain offsets, etc.  Perhaps we could
>   either precalculate them or even add some as constant tables.
>   Especially non-border cases can be significantly improved because
>   most pixels are not at the border :-)

yes

> 
> - The multiplication with the picture weights can perhaps be joined in
>   these tables?

yes

> 
> Actually, this is the code which I dislike the most.  But we had to
> make a choise on how to spend the time.  Is this a showstopper for
> you?

no, though its something which should be optimized its likely the worst
part of the dirac code ATM

> 
> I hope people will be interested in optimizing my code, vectorizing
> it, etc. after summer of code.  I certainly am as much as time permits
> (I have to graduate soon...).  So after FFmpeg, feel free to add me to
> the MAINTAINERS list for Dirac, if you think I am up to it and if
> you'd like that.

add yourself now to maintainers, you have svn write access :)

[...]

-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

I hate to see young programmers poisoned by the kind of thinking
Ulrich Drepper puts forward since it is simply too narrow -- Roman Shaposhnik
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-soc/attachments/20070816/0c12a307/attachment.pgp>