[FFmpeg-devel] [PATCH] avfilter: add hflip x86 SIMD
Martin Vignali
martin.vignali at gmail.com
Sun Dec 3 21:41:15 EET 2017
2017-12-03 20:36 GMT+01:00 Paul B Mahol <onemda at gmail.com>:
> On 12/3/17, Martin Vignali <martin.vignali at gmail.com> wrote:
> >>
> >> In any case, if clang or gcc can generate better code, then the hand
> >> written version needs to be optimized to be as fast or faster.
> >>
> >>
> >>
> > Quick test : pass checkasm (but probably only because width = 256)
> > hflip_byte_c: 26.4
> > hflip_byte_ssse3: 20.4
> >
> >
> > INIT_XMM ssse3
> > cglobal hflip_byte, 3, 5, 2, src, dst, w, x, v, src2
> > mova m0, [pb_flip_byte]
> > xor xq, xq ; <======
> > mov wd, dword wm
> > sub wq, mmsize * 2
> > ;remove the cmp here <======
> > jl .skip
> >
> > .loop0: ; process two xmm in the loop
> > neg xq
> > movu m1, [srcq + xq - mmsize + 1]
> > movu m2, [srcq + xq - mmsize * 2 + 1] <======
> > pshufb m1, m0
> > pshufb m2, m0 <======
> > neg xq
> > movu [dstq + xq], m1
> > movu [dstq + xq + mmsize], m2 <======
> > add xq, mmsize * 2 <======
> > cmp xq, wq
> > jl .loop0
> > RET ; add RET here
> >
> > ; MISSING one xmm process if need
> >
> > .skip:
> > add wq, mmsize
> > .loop1:
> > neg xq
> > mov vb, [srcq + xq]
> > neg xq
> > mov [dstq + xq], vb
> > add xq, 1
> > cmp xq, wq
> > jl .loop1
> > RET
>
> So what is wrong now?
>
Doesn't see your email, when i send mine.
Check asm result with your last patch (and modify for the short version
"add xq, mmsize" to "add xq, mmsize * 2")
hflip_byte_c: 28.0
hflip_byte_ssse3: 127.5
hflip_short_c: 276.5
hflip_short_ssse3: 100.2
Do you think if you add RET after the end of loop0 , it can work in all
cases ?
More information about the ffmpeg-devel
mailing list