[FFmpeg-devel] transcoding on nvidia tesla
Sun Feb 10 23:12:23 CET 2008
Ian Caulfield wrote:
>On Feb 1, 2008 12:05 AM, M?ns Rullg?rd <mans at mansr.com> wrote:
>>Is there any reason to believe that each of these threads has the
>>power of a full CPU at its disposal?
>Nowhere near - they're more like a funky SIMD - the threads are
>grouped into 16's, which follow the same execution path - if threads
>diverge, they have to be serialised. However, with careful
>programming, very good memory bandwidth can be achieved. Memory
>bandwidth on/off chip can be an issue though. I don't see 1000x
>speedups for video coding - 10x seema more likely, at the cost of a
>lot of development time.
Having done some gpu dev, I can tell that there's some good and some
very bad things to do...
Easy ones, -huge- performance increase :
Rescaling with various algos, color space conversions, basic deblocking,
More tricky, probably faster by factor of 10 but with quite some
optimisation and dev time :
(i)Motion compensation, (i)dct, wavelets ...
Useless, same speed or 10x slower : (because conditionnal branching
cannot be avoided)
Byte stream parsing, sorting...
Total lost of time and 100x slower on gpu : (gpu probably has to
emulates all the required bit functions and data impose a serial
operation so no parallelisation is possible)
Bit stream parsing....
Using operations that need branching is very hard to make fast.
In most cases, it is faster to process both branches and make a
conditionnal assignment later... (where possible)
It is not that gpus are that bad with branching (threads are grouped by
2x8 so it "only" serialize 8 threads)
but more that cpus became excessively good at it.
CUDA has a much better memory transfer performance than DirectX /
OpenGL, examples show 3Gbytes/sec (up and down) but it vastly depends on
Anyhow, it is still a memory copy. If you need to do this often it will
Memory bandwidth on the card is also huge. (bus is 384bits wide if I
remember...) so any massive memory operation will have an excellent speedup.
Probably an advantage for HD processing.
You also don't need a tesla to code with CUDA. Any Geforce 8800 will
probably do. (need different load dispatching between threads)
Someone tried to port the wavelets part of the Jpeg2000
encoding/decoding on gpu (but not using cuda)
Even with the latest card, there's no significant performance gain when
using the gpu. I don't know the exact reasons, but I know that the
bitstream parsing and arithmetic coding represent an important part of
the process and cannot be ported to gpu easily.
I believe the real future of video codecs is more in the proprietary
chip parts that are dedicated to video processing. (and unfortunately
the ones that are not standardized and public)
More information about the ffmpeg-devel