[FFmpeg-devel] [PATCH] SSE2 and SSSE3 versions of h264 biweight prediction code (biweight_h264_pixels_tab)

Tue Aug 3 20:27:20 CEST 2010

On Fri, Jul 30, 2010 at 7:47 PM, Eli Friedman <eli.friedman at gmail.com> wrote:
> On Thu, Jul 29, 2010 at 11:15 AM, Eli Friedman <eli.friedman at gmail.com> wrote:
>> On Thu, Jul 29, 2010 at 9:23 AM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
>>> Hi,
>>>
>>> On Thu, Jul 29, 2010 at 12:32 AM, Eli Friedman <eli.friedman at gmail.com> wrote:
>>>> Patch attached. ?Loosely based off of the MMX2 version. ?Around 1%
>>>> faster overall on a test file on my Mobile Core i5.
>>> [..]
>>>> +cglobal h264_biweight_8x8_ssse3, 7, 7, 8
>>>> + ? ?BIWEIGHT_SSSE3_SETUP
>>>> + ? ?mov ? ? ? ?r3, 4
>>>> +
>>>> +.nextrow
>>>> + ? ?BIWEIGHT_SSSE3_OP r2
>>>> + ? ?movh ? ? ? [r0], m0
>>>> + ? ?movhps ? ? [r0+r2], m0
>>>> + ? ?lea ? ? ? ?r0, [r0+r2*2]
>>>> + ? ?lea ? ? ? ?r1, [r1+r2*2]
>>>> + ? ?dec ? ? ? ?r3
>>>> + ? ?jnz .nextrow
>>>> + ? ?REP_RET
>>>
>>> You have several unused r%d regs here, maybe you want to use lea r4,
>>> [r2*2] and then use add r0/r1, r4 instead of lea, that should result
>>> in slightly smaller code. Same for h264_biweight_8x8_sse2.
>>
>> Will do.
>
> Done in attached.
>
>>>> +%macro BIWEIGHT_SSSE3_OP 1
>>>> + ? ?movh ? ? ? m0, [r0]
>>>> + ? ?movh ? ? ? m1, [r1]
>>>> + ? ?movh ? ? ? m2, [r0+%1]
>>>> + ? ?movh ? ? ? m3, [r1+%1]
>>>> + ? ?punpcklbw ?m0, m1
>>>> + ? ?punpcklbw ?m2, m3
>>>
>>> If you don't use m1/m3 afterwards, you can IIRC just punpcklbw m0,
>>> [r0+%1] and same for the line below.
>>
>> I don't have appropriate alignment for the 8x8 case, but I suppose I
>> can do it in the 16x16 case.
>
> Done in attached; saves one instruction per 16 pixels in the 16x16
> case. ?(I could fiddle with it to remove another load, but I doubt the
> speed would be significantly different.)

Ping.

-Eli