[Ffmpeg-devel] MMX/MMX2 and SSE optimizations for H.264 decoding

Martin Boehme boehme
Fri Sep 23 12:16:44 CEST 2005

Loren Merritt wrote:
> On Mon, 19 Sep 2005, Martin Boehme wrote:
>> Michael Niedermayer wrote:
>>> On Fri, Sep 16, 2005 at 02:20:42PM +0200, Martin Boehme wrote:
>>>> Loren Merritt wrote:
>>>>> On AMD, most SSE2 instructions take exactly twice as long as the 
>>>>> equivalent MMX instruction. Any speedups are due only to scheduling.
>>>>> In x264, we have a bunch of SSE2 functions, but most of them are 
>>>>> _slower_ than the MMX versions on AMD.
>>>> Interesting -- wasn't aware of that. I would assume that the AMD 
>>>> processors only have enough execution units for 64 bits worth of 
>>>> data and have to do SSE operations in two gos?
>>> dunno but
>>> AFAIK the P4 (at least the older ones) have 2 MMX units running at 
>>> half the
>>> cpu clock speed so they can execute either 1 MMX instruction per 
>>> clock or
>>> 1 SSE(2) every 2 clocks, with a very small number of exceptions
>>> further note that execution itself isnt the only thing which can be a 
>>> bottleneck ...
>> Interesting, wasn't aware of that... it's probably chip space 
>> considerations that play into that, given that there shouldn't be 
>> aren't any dependencies between the individual "elements" of the 
>> vector units?
> Isn't that the obvious way to do it? If you have the hardware to execute
> 1 SSE2 instruction at a time, why shouldn't it be able to do 2 MMX?
> (assuming no other bottlenecks, of course)

Right... but what I was trying to get at is this: if the P4 had 4 MMX 
units (running at half the CPU clock speed), it would be able to execute 
1 SSE2 instruction per clock... and the reason Intel didn't put 4 MMX 
units on the chip is probably chip space considerations...?


Martin B?hme
Inst. f. Neuro- and Bioinformatics
Ratzeburger Allee 160, D-23538 Luebeck
Phone: +49 451 500 5514
Fax:   +49 451 500 5502
boehme at inb.uni-luebeck.de

More information about the ffmpeg-devel mailing list