[FFmpeg-devel] [flamefest-start] A little something on MMX/SSE intrinsics
Sat Mar 1 13:38:22 CET 2008
Ivan Kalvachev wrote:
> How are PPCs so scheduling-sensitive?
In the specific case of G3 vs G4 vs G5 vs CELL the main issue is that
the people thinking about how to use silicon surface and power had quite
radical different ideas on what to do. So the G3 alu has certain
features and optional stuff that in G4 aren't as fast, similar
differences between G4 and G5 Altivec implementation, CELL is a world
apart since you have to consider branch hints since they preferred have
another hardware thread instead putting a complex branch predictor...
(they cut other corners as well, the idea isn't bad given you and your
compiler are aware of them)
> Usually you write instructions with as much parallelism as possible
> and the CPU is expected to execute as much instructions as it can.
That is fine, the problem is which instructions. Right now the best way
to have sane code overall is writing branchless simd, use the cache
hinter but forget the stream cache hinter (works just on G4) and try to
keep in mind how deep the pipeline is and the load/store delay and other
interesting details that gcc should have already and should use(e.g. to
reorder/change appropriately instructions, generate constants out of
immediate instructions instead of loads the values (can be faster) and
keep in mind how altivec interacts with the scalar alu.)
G4 has a quite reduced bandwidth with the memory but has ways to make
the dma engine behave (stream hints), G5 has better access to memory but
you lose the stream hints, CELL has an even better memory management
BUT you have higher penalties for missing branch and other peculiarities.
> I just want the summary, not reading 5-6 optimization manuals.
I hope I given you an idea.
Gentoo Council Member
More information about the ffmpeg-devel