[FFmpeg-cvslog] r10223 - in trunk/libavcodec/i386: dsputil_mmx.c snowdsp_mmx.c

Michael Niedermayer michaelni
Mon Aug 27 21:06:45 CEST 2007


Hi

On Mon, Aug 27, 2007 at 03:07:51PM +0200, Reimar D?ffinger wrote:
> Hello,
> On Mon, Aug 27, 2007 at 01:05:03PM +0200, Michael Niedermayer wrote:
> [...]
> > if you want to fix it i dont mind helping, awnsering questions (even stupid
> > questions about asm)
> > i do mind fixing it myself because
> > 1.i dont have a SSE2 system here, i only have ssh access to one which isnt
> >   that good for actually watching the result ...
> > 2.if someone else fixes it that means someone else will afterwards be more
> >   familiar with the code and that means more potential developers who might
> >   improve it or other asm code in ffmpeg ...
> 
> Well, I do not have too much time, but still here is an attempt to
> partially fix some things and add some helpful comments.
> I think inner_add_yblock_bw_8_obmc_16_bh_even_sse2 works now, though
> though I only get a black-and-white image...

if you get black and white then it likely does not work or what happens if
you call the c code instead of it?


> I don't claim the performance to be great or anything, and I am also
> wondering about slice_buffer_get_line that is used by the C code but not
> the asm...
> Maybe you can comment on a few things, like which of the comments you
> think should stay, if adding the "&& add" is correct and if yes if the
> add == 0 case is worth optimizing etc.

this function isnt called with add==0 currently

about the comments, well i do not like the way the code looks at all its
totally unreadable, id rather see the code implemented with less macro
obfuscation than comments explaining what the macros do and what they
need in what register

putting common code in a macro is one thing but putting the opening asm(
with the first 5 (likely unneeded) instructions in one and the last
3 lines with the asm constraints in another well its like
putting 

functionX(...){
int i;

in a macro and putting
return i; 
}

in another

and the function body in a third ...

no half sane c programmer would do that even if it reduces the number of lines
of code, why is it done with asm here, i dont know


[...]
n\t"
>  
> +/* load two rows (8 bytes each) of block[ptr_offset] expanded to 16 bits into
> + * out_reg1 and out_reg2 and multiply by the 8 bytes at
> + * obmc[2*y*obmc_stride + s_offset] and  obmc[(2*y + 1)*obmc_stride + s_offset]
> + */
>  #define snow_inner_add_yblock_sse2_start_8(out_reg1, out_reg2, ptr_offset, s_offset)\
>               "mov "PTR_SIZE"*"ptr_offset"(%%"REG_a"), %%"REG_d"; \n\t"\
>               "movq (%%"REG_d"), %%"out_reg1" \n\t"\

not doxygen compatible


> @@ -689,47 +702,35 @@
>  static void inner_add_yblock_bw_8_obmc_16_bh_even_sse2(const uint8_t *obmc, const long obmc_stride, uint8_t * * block, int b_w, long b_h,
>                        int src_x, int src_y, long src_stride, slice_buffer * sb, int add, uint8_t * dst8){
>  snow_inner_add_yblock_sse2_header
> +// xmm1/5 = obmc1[x] * block[3][x + y*src_stride]
>  snow_inner_add_yblock_sse2_start_8("xmm1", "xmm5", "3", "0")
> +// xmm1/5 += obmc2[x] * block[2][x + y*src_stride]
>  snow_inner_add_yblock_sse2_accum_8("2", "8")
> +// xmm1/5 += obmc3[x] * block[1][x + y*src_stride]
>  snow_inner_add_yblock_sse2_accum_8("1", "128")
> +// xmm1/5 += obmc4[x] * block[0][x + y*src_stride]
>  snow_inner_add_yblock_sse2_accum_8("0", "136")
>  
>               "mov %0, %%"REG_d"              \n\t"
> -             "movdqa (%%"REG_D"), %%xmm0     \n\t"
> -             "movdqa %%xmm1, %%xmm2          \n\t"
> +             "movdqu (%%"REG_D"), %%xmm0     \n\t"

why an unaligned read?


>  
> -             "punpckhwd %%xmm7, %%xmm1       \n\t"
> -             "punpcklwd %%xmm7, %%xmm2       \n\t"
> -             "paddd %%xmm2, %%xmm0           \n\t"
> -             "movdqa 16(%%"REG_D"), %%xmm2   \n\t"
> -             "paddd %%xmm1, %%xmm2           \n\t"
> +             "paddd %%xmm1, %%xmm0           \n\t"

no, these are 16 not 32 bit
also theres some shift by 4 missing here


[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Complexity theory is the science of finding the exact solution to an
approximation. Benchmarking OTOH is finding an approximation of the exact
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-cvslog/attachments/20070827/e515ec67/attachment.pgp>



More information about the ffmpeg-cvslog mailing list