[FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.

Mark Thompson sw at jkqxz.net
Wed Jul 6 16:58:24 EEST 2016


On 06/07/16 13:28, Dan Parrot wrote:
> On Wed, 2016-07-06 at 09:07 +0200, Hendrik Leppkes wrote:
>> On Wed, Jul 6, 2016 at 4:37 AM, Dan Parrot <dan.parrot at mail.com> wrote:
>>> Finish providing SIMD versions for POWER8 VSX of functions in libswscale/input.c That should allow trac ticket #5570 to be closed.
>>> The speedups obtained for the functions are:
>>>
>>> abgrToA_c               1.19
>>> bgr24ToUV_c             1.23
>>> bgr24ToUV_half_c        1.37
>>> bgr24ToY_c_vsx          1.43
>>> nv12ToUV_c              1.05
>>> nv21ToUV_c              1.06
>>> planar_rgb_to_uv        1.25
>>> planar_rgb_to_y         1.26
>>> rgb24ToUV_c             1.11
>>> rgb24ToUV_half_c        1.10
>>> rgb24ToY_c              0.92
>>> rgbaToA_c               0.88
>>> uyvyToUV_c              1.05
>>> uyvyToY_c               1.15
>>> yuy2ToUV_c              1.07
>>> yuy2ToY_c               1.17
>>> yvy2ToUV_c              1.05
>>
>> SIMD implementations that in the best case improve the speed by 43%
>> (and in some cases is *slower*) seem barely worth it. One would expect
>> a proper SIMD implementation to offer 100% or higher increases, at
>> least thats the general expectation on x86 with SSE/AVX.
> It sounds like you have either forgotten or never learned a very basic
> principle of computer architecture. I recommend the text by Patterson
> and Hennessey. The principle is Amdahl's Law. Before you start throwing
> numbers around, make sure you understand what was being parallelized.

Right, it's being parallelised to 4x or 8x SIMD in vector registers.  If the
computation is processor-bound, we would expect a gain close to that, minus a
bit of overhead for awkwardness around rearranging the data (generally 2x or
more, but dependent on the exact cases).  The bits of code I looked at appear to
cleanly implement that, so I am surprised that the improvement does not match
what we expect.

Some possible explanations for this:
* There is some problem with the code.  It looks clean and sensible to me,
though I am not an expert on PPC.  Maybe there is some large state-transition
hazard which I am not aware of, similar to problems with mixing AVX and SSE on
Intel?
* The computation is actually bounded by something other than the CPU.  For
example, memory - that seems unlikely, but I guess it's possible.  Can we
measure that to rule it out?
* The compiler is doing a bad job with the intrinsics.  Do we believe that any
other compiler would do better, and can we test with that one instead?  If so,
perhaps we could whitelist compilers known to do sensible things with these
intrinsics and keep the vanilla C implementation on others.

I imagine there are more possibilities.  It's more the fact that the results are
not as expected and include some regressions which makes me want to understand
what is going on, not that I believe that the code itself necessarily contains
any issues.

>> So the question here is - is thats VSX being bad, or the intrinsics
>> being bad? How would the speedup be in proper hand-written ASM?
>> If hand-written ASM can give us the usual 100-200% improvements we would
>> expect from SIMD, then this is what should generally be favored.
> I am not got to write assembly just so you get a nice fuzzy feeling. If that's a deal-breaker, so be it.

Assembly is not directly relevant, though if it would be possible to write some
for one of the important cases (rgb24toY, say) then it could form a useful
comparison to determine what is going on with the intrinsic implementation.

>> Also, one further thought:
>> From the commit message, it sounds like you might only be doing this
>> for the bounty in #5570, do you plan to maintain these optimizations
>> in the future?
> 
> Unless you are a mind reader, STFU about my motivation in writing code.

Your direct motivation is not relevant, but your intent or not to maintain this
code in future is.  A sensible first implementation with little real gain (as
this appears to be from what I can tell so far) is I think reasonable to accept
if further work to improve it is intended to happen in future, but if we do not
anticipate that then some orphaned code fragments with no clear benefit are
unlikely to be a useful thing to have in the tree.



More information about the ffmpeg-devel mailing list