[FFmpeg-devel] [PATCH] avfilter/x86/vf_blend.asm: add hardmix and phoenix sse2 SIMD

Ronald S. Bultje rsbultje at gmail.com
Wed Oct 7 13:37:08 CEST 2015


Hi,

On Wed, Oct 7, 2015 at 5:38 AM, Paul B Mahol <onemda at gmail.com> wrote:

> Signed-off-by: Paul B Mahol <onemda at gmail.com>
> ---
>  libavfilter/x86/vf_blend.asm    | 62
> +++++++++++++++++++++++++++++++++++++++++
>  libavfilter/x86/vf_blend_init.c | 14 ++++++++++
>  2 files changed, 76 insertions(+)
>
> diff --git a/libavfilter/x86/vf_blend.asm b/libavfilter/x86/vf_blend.asm
> index 167e72b..7180817 100644
> --- a/libavfilter/x86/vf_blend.asm
> +++ b/libavfilter/x86/vf_blend.asm
> @@ -27,6 +27,8 @@ SECTION_RODATA
>
>  pw_128: times 8 dw 128
>  pw_255: times 8 dw 255
> +pb_128: times 16 db 128
> +pb_255: times 16 db 255
>
>  SECTION .text
>
> @@ -273,6 +275,36 @@ cglobal blend_darken, 9, 10, 2, 0, top, top_linesize,
> bottom, bottom_linesize, d
>      jg .nextrow
>  REP_RET
>
> +cglobal blend_hardmix, 9, 10, 3, 0, top, top_linesize, bottom,
> bottom_linesize, dst, dst_linesize, width, start, end
> +    add      topq, widthq
> +    add   bottomq, widthq
> +    add      dstq, widthq
> +    sub      endq, startq
> +    neg    widthq
> +.nextrow:
> +    mov       r10q, widthq
> +    %define      x  r10q
>

You're saying that you use 10 regs, but you're using r10, which is the
11th. Use r9 here, or specify that you use 11.

Now, more generally, you're using a lot of regs in all your simd, and some
aren't necessary, so some lessons about arguments: most arguments come on
stack. On x86-64, the first 4 (win64) or 6 (unix64) come in registers, but
the rest (width, start, end) come on stack. On x86-32, all arguments come
on stack. So, if you get 9 arguments, you have 3 arguments at least on
stack, including width. That means you don't have to move width into r10q;
you can move widthmp (the stack version of this argument) into widthq at
the start of each row, since the system already put width on stack for you.
x86inc.asm moves it from stack into a register for you when you say cglobal
name, %d and %d >= 7 (where width is the 7th argument).

Then, you can also sub startmp from endq, which you can then store back
into endmp on x86-32, and suddenly on x86-32 you only need 7 regs (for
x86-64, you keep using endd since that's faster). And now, your simd works
on 32bit systems as well.

+    .loop:
> +        movu            m0, [topq + x]
> +        movu            m1, [bottomq + x]
> +        mova            m2, [pb_255]
> +        psubusb         m2, m1


pxor m1, [pb_255] should be the same as mova reg, [pb_255] and psubusb reg,
m1

Now, you're using pb_255 a lot inside your inner loop, and with pxor, you
only use it non-destructively, so why not move it into a register (m3)
outside the loop so you only load it from mem once?

Ronald


More information about the ffmpeg-devel mailing list