[FFmpeg-devel] [PATCH] SSE2 version of vf_idet's filter_line()

Wed Sep 3 09:16:03 CEST 2014

On 03.09.2014, at 08:38, Pascal Massimino <pascal.massimino at gmail.com> wrote:
> On Tue, Sep 2, 2014 at 10:26 PM, Reimar Döffinger <Reimar.Doeffinger at gmx.de>
> wrote:
> 
>> On 03.09.2014, at 00:49, Pascal Massimino <pascal.massimino at gmail.com>
>> wrote:
>>> On Tue, Sep 2, 2014 at 9:39 AM, Michael Niedermayer <michaelni at gmx.at>
>>> wrote:
>>> 
>>> 
>>> [ahem: ffmpeg doesn't feel like using intrinsics, by chance?]
>> 
>> I tried that about 5 months back, once more.
>> It still results in code that is slower than the plain C version, even
>> when using SIMD, on trivial NEON audio format conversion (same thing in asm
>> was about 8x faster).
>> So you can get the same effect with less effort by disabling just
>> disabling asm code.
>> 
> 
> strange. I exclusively used intrinsics for libwebp (x86, but also
> neon/aarch64) and was pretty
> pleased with the result (say <2% perf loss, but 10x easier maintenance and
> friendliness to non-guru contributors).

I guess you never used uint16x8x2 and similar types then, because almost any access to them seems to go via the stack.
See the last file of http://lists-archives.com/mplayer-dev-eng/38036-add-neon-optimizations-to-some-critical-audio-functions.html , it spilled the data to stack twice per loop iteration.