[FFmpeg-devel] [PATCH] VP8 luma(16) inner-MB H/V loopfilter MMX/SSE2

Sun Jul 11 20:47:43 CEST 2010

On Sun, 11 Jul 2010, Michael Niedermayer wrote:
> On Sun, Jul 11, 2010 at 04:52:04PM +0000, Loren Merritt wrote:
>> On Sun, 11 Jul 2010, Ronald S. Bultje wrote:
>>
>>> You'll notice that the sse2 is significantly slower here, my rough
>>> guess is that this is because of my shitty CPU which pretty much
>>> emulates xmm-ops through mmx-ops, so it doesn't add a lot of benefit
>>> other than not having to setup the loop for doing the second 8 pixels,
>>> combined with the added complexity of a 8x16 transpose before the
>>> actual filter. I'm betting that on an actual sse2-supporting CPU
>>> (Jason?), this would still be faster, but we might want to put this
>>> under a FF_MM_SSE2_NOT_SHITTY flag or something along those lines. If
>>> you think my code is shitty, comments are welcome also. ;-).
>>
>> Rather than special-casing most of the functions, we at x264 declared that
>> Core1 doesn't have sse2, and changed the cpuid parser accordingly.
>> If you want to support the few cases where sse2 is slightly faster than
>> mmx, I recommend picking a different flag for that and applying it only
>> when you've tested on Core1, so that FF_MM_SSE2 can be trusted to dwim in
>> the usual case.
>>
>> --Loren Merritt
>
>>  cpuid.c |   14 +++++++++++++-
>>  1 file changed, 13 insertions(+), 1 deletion(-)
>> 7ba0916766645e2de9330e9ba8f30d815da14c91  cpuid.diff
>
> do we have any float SSE2 code that this could affect negatively?
> if not iam ok with this patch

ff_lpc_compute_autocorr_sse2

--Loren Merritt