[Ffmpeg-devel] MMX/MMX2 and SSE optimizations for H.264 decoding
Fri Sep 23 12:16:44 CEST 2005
Loren Merritt wrote:
> On Mon, 19 Sep 2005, Martin Boehme wrote:
>> Michael Niedermayer wrote:
>>> On Fri, Sep 16, 2005 at 02:20:42PM +0200, Martin Boehme wrote:
>>>> Loren Merritt wrote:
>>>>> On AMD, most SSE2 instructions take exactly twice as long as the
>>>>> equivalent MMX instruction. Any speedups are due only to scheduling.
>>>>> In x264, we have a bunch of SSE2 functions, but most of them are
>>>>> _slower_ than the MMX versions on AMD.
>>>> Interesting -- wasn't aware of that. I would assume that the AMD
>>>> processors only have enough execution units for 64 bits worth of
>>>> data and have to do SSE operations in two gos?
>>> dunno but
>>> AFAIK the P4 (at least the older ones) have 2 MMX units running at
>>> half the
>>> cpu clock speed so they can execute either 1 MMX instruction per
>>> clock or
>>> 1 SSE(2) every 2 clocks, with a very small number of exceptions
>>> further note that execution itself isnt the only thing which can be a
>>> bottleneck ...
>> Interesting, wasn't aware of that... it's probably chip space
>> considerations that play into that, given that there shouldn't be
>> aren't any dependencies between the individual "elements" of the
>> vector units?
> Isn't that the obvious way to do it? If you have the hardware to execute
> 1 SSE2 instruction at a time, why shouldn't it be able to do 2 MMX?
> (assuming no other bottlenecks, of course)
Right... but what I was trying to get at is this: if the P4 had 4 MMX
units (running at half the CPU clock speed), it would be able to execute
1 SSE2 instruction per clock... and the reason Intel didn't put 4 MMX
units on the chip is probably chip space considerations...?
Inst. f. Neuro- and Bioinformatics
Ratzeburger Allee 160, D-23538 Luebeck
Phone: +49 451 500 5514
Fax: +49 451 500 5502
boehme at inb.uni-luebeck.de
More information about the ffmpeg-devel