[FFmpeg-devel] [WIP] [PATCH 4/4] x86: dsputilenc: convert hf_noise*_mmx to yasm

Mon Jun 2 15:31:09 CEST 2014

On Sun, Jun 01, 2014 at 08:33:16PM -0700, Timothy Gu wrote:
> On Jun 1, 2014 6:36 PM, "Michael Niedermayer" <michaelni at gmx.at> wrote:
> > > +%if %1 == 16
> > > +    push pix1q
> > > +    push hq
> > > +%endif
> >
> 
> > dont use push/pop they can messup the yasm magic macros
> > you can use PUSH/POP but better dont use them either, there should be
> > enough registers
> 
> With the other local variable, there will be 6 registers used, which is IMO
> a lot for a function like this. Is there any significant performance
> penalty using PUSH/POP vs. local variable?

using an additional register has no cost, it might in some cases
require the yasm magic to push and pop it to preserve its content
for the caller but thats what you are doing too

also see http://www.agner.org/optimize/#manuals

> 
> > > +    sub        hd, 2
> > > +    pxor       m7, m7
> > > +    pxor       m6, m6
> > > +    HF_NOISE_PART1 %1, 0, 1, 2, 3
> > > +    add     pix1q, lsizeq
> > > +    HF_NOISE_PART1 %1, 4, 1, 5, 3
> > > +    HF_NOISE_PART2     0, 2
> > > +    add     pix1q, lsizeq
> > > +.loop:
> > > +    HF_NOISE_PART1 %1, 0, 1, 2, 3
> > > +    HF_NOISE_PART2     4, 5
> > > +    add     pix1q, lsizeq
> > > +    HF_NOISE_PART1 %1, 4, 1, 5, 3
> > > +    HF_NOISE_PART2     0, 2
> > > +    add     pix1q, lsizeq
> > > +    sub        hd, 2
> > > +        jne .loop
> > > +
> > > +    mova       m0, m6
> > > +    punpcklwd  m0, m7
> > > +    punpckhwd  m6, m7
> > > +    paddd      m6, m0
> > > +    mova       m0, m6
> > > +    psrlq      m6, 32
> > > +    paddd      m0, m6
> > > +%if %1 == 16
> >
> > > +    movd      ebx, m0   ; ebx = result of hf_noise16;
> >
> > you cant just write into a random register
> > declare a local variable in the cglobal macro above and use it instead
> 
> OK. But how about the return value at the end? Is eax specifically designed
> to be clobbered?

yes in functions retunring a 32bit int but

see http://en.wikipedia.org/wiki/X86_calling_conventions
(also maybe someone has a better reference?)

> 
> >
> >
> >
> > > +    pop        hq       ; restore h and pix1
> > > +    pop     pix1q
> > > +    ; lsize is unchanged (except movsxd, which hf_noise8 is going to
> do anyway)
> > > +    add     pix1q, 8    ; pix1 = pix1 + 8;
> >
> > > +    call    hf_noise8   ; eax = hf_noise8_mmx(pix1, lsize, h);
> >
> 
> > dont call cglobal functions, if you do you would have to emulate the
> > calling conventions of all ABIs, x86_32 would pass arguments over the
> > stack for example
> 
> Should I then plug in a version of `HF_NOISE 8` without cglobal and stuff?

a macro might be easier if you duplicate it anyway

> 
> >
> > also looking at the disassembly of the function with gdb and the
> > register values when it crashes (if it does) or single steping through
> > the code wth gdb should help you understand whats the problem or
> > difference between what you want and what the computer actually does
> 
> I spent 2 nights trying to debug this, without any luck. It seems like
> [pix1q+2*lsizeq] is unallocated, as it crashes in the first instruction in
> the loop, the first time it is executed. I can't understand how this would
> happen.

well you can print the address and size of the array you intend to
access with av_log() before the function call and then inspect
in gdb what is actually accessed, it should be outside this printed
rebion if it crashes ...

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Opposition brings concord. Out of discord comes the fairest harmony.
-- Heraclitus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 181 bytes
Desc: Digital signature
URL: <https://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20140602/ec207157/attachment.asc>