[FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86 SIMD for filter_column()

Paul B Mahol onemda at gmail.com
Wed Dec 4 10:51:52 EET 2019


On 12/4/19, Song, Ruiling <ruiling.song at intel.com> wrote:
>> -----Original Message-----
>> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of
>> chen
>> Sent: Wednesday, December 4, 2019 9:36 AM
>> To: FFmpeg development discussions and patches <ffmpeg-
>> devel at ffmpeg.org>
>> Subject: Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86
>> SIMD for filter_column()
>>
>>
>>
>> At 2019-12-04 08:59:08, "Song, Ruiling" <ruiling.song at intel.com> wrote:
>> >> -----Original Message-----
>> >> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of
>> >> chen
>> >> Sent: Tuesday, December 3, 2019 4:59 PM
>> >> To: FFmpeg development discussions and patches <ffmpeg-
>> >> devel at ffmpeg.org>
>> >> Subject: Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add
>> >> X86
>> >> SIMD for filter_column()
>> >>
>> >> comments inline in code
>> >>
>> >>
>> >> At 2019-12-03 15:52:07, xujunzz at sjtu.edu.cn wrote:
>> >> >From: Xu Jun <xujunzz at sjtu.edu.cn>
>> >[...]
>> >> >+
>> >> >+        cvtdq2ps m4, m4
>> >> >+        mulps m4, m0     ; sum *= rdiv
>> >> >+        addps m4, m1     ; sum += bias
>> >>
>> >> >+        addps m4, m5     ; sum += 0.5
>> >> I don't know how about precision mismatch if we pre-compute (bias+0.5)
>>
>> >I think it is hard to prove it is safe to do pre-compute.
>> Agree, I also worried precision issue since float operator is execute
>> order
>> dependent.
>> How about ROUNDPS?
> Seems no exactly match.
>>
>>
>> >
>> >>
>> >>
>> >> >+        cvttps2dq m4, m4
>> >> >+        packssdw m4, m4
>> >> >+        packuswb m4, m4
>> >> >+        movss [dstq + dst_offq], m4
>> >> >+        add c_offq, mmsize/4
>> >> >+        add dst_offq, mmsize/4
>> >> >+
>> >> >+        add off16q, mmsize/4
>> >> >+        cmp off16q, widthq
>> >> >+        jl .loop16
>> >> >+
>> >> >+    add widthq, rq
>> >> >+    cmp off16q, widthq
>> >> >+    jge .paraend
>> >> >+
>> >>
>> >> >+    .loopr:
>> >> no idea about this loop, if we can read beyond, we can reuse above
>> >> SIMD
>> >> code
>> >Reuse above SIMD code may write to the memory that does not belong to
>> this slice-thread.
>>
>> >IMO, the code to handle remainder columns is still necessary.
>>
>>
>> Depends on algorithm & size,
>> For example width=23
>> Process #0 [0:15]
>> Process #1 [7:22]
>> Both of them is multiple of 16
> Sounds interesting. But FFmpeg does not do like this now.
> One question is will this get a penalty for writing to same address of
> memory (both are writing to 7-15) from different threads?

Yes, and even bad results may happen.

>
>>
>> _______________________________________________
>> ffmpeg-devel mailing list
>> ffmpeg-devel at ffmpeg.org
>> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>>
>> To unsubscribe, visit link above, or email
>> ffmpeg-devel-request at ffmpeg.org with subject "unsubscribe".
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request at ffmpeg.org with subject "unsubscribe".


More information about the ffmpeg-devel mailing list