[FFmpeg-devel] [PATCH] avfilter/vf_stereo3d: add x86 SIMD for anaglyph outputs

Ronald S. Bultje rsbultje at gmail.com
Sun Oct 4 23:49:10 CEST 2015


Hi,

On Sun, Oct 4, 2015 at 3:46 PM, Paul B Mahol <onemda at gmail.com> wrote:

> +    .loop:
> +        movd                 m10, [ana_matrix_rq+ 0]
> +        movd                 m11, [ana_matrix_rq+ 4]
> +        movd                 m12, [ana_matrix_rq+ 8]
> +        movd                 m13, [ana_matrix_rq+12]
> +        movd                 m14, [ana_matrix_rq+16]
> +        movd                 m15, [ana_matrix_rq+20]
> +        pshufd               m10, m10, q0000
> +        pshufd               m11, m11, q0000
> +        pshufd               m12, m12, q0000
> +        pshufd               m13, m13, q0000
> +        pshufd               m14, m14, q0000
> +        pshufd               m15, m15, q0000
>
[..]

> +        movd                 m10, [ana_matrix_bq+ 0]
> +        movd                 m11, [ana_matrix_bq+ 4]
> +        movd                 m12, [ana_matrix_bq+ 8]
> +        movd                 m13, [ana_matrix_bq+12]
> +        movd                 m14, [ana_matrix_bq+16]
> +        movd                 m15, [ana_matrix_bq+20]
> +        pshufd               m10, m10, q0000
> +        pshufd               m11, m11, q0000
> +        pshufd               m12, m12, q0000
> +        pshufd               m13, m13, q0000
> +        pshufd               m14, m14, q0000
> +        pshufd               m15, m15, q0000
>

So, you want more registers, right? :-D. OK, so let's talk stack usage. you
want aligned stack here to put all these constants so you don't need to
recreate them in each loop cycle iteration.

change:
cglobal name, n_args, n_gprs, n_xmms, arg1, arg2, arg3
to:
cglobal name, n_args, n_gprs, n_xmms, aligned_memory_in_bytes, arg1, arg2,
arg3

In your case, add memory of 6*mmsize*3.

Now, in the function, prepare the stack space first:

movd m10, [ana_matrix_rq+0]
[etc for the other r args]
pshufd m10, m10, q0000
[etc for the other r args]
mova [rsp+mmsize*0], m10
[etc for the others into rsp+mmsize*1-5]

now do the same for g/b in mmsize*6-11 and 12-17

Now as pshufb argument, use [rsp+mmsize*0-17].

> +        packusdw             m1, m1
> +        packuswb             m1, m1
> +        pshufb               m7, m1, [rshuf]

Try to do r/g/b all at the same time (especially now that you have more
registers available since m10-15 are free), and packusdw r/g together, and
then packuswb r/g and b/nothing together, so that you have a single output
register instead of 3. That saves you the pors at the end also.

Ronald


More information about the ffmpeg-devel mailing list