[FFmpeg-devel] Once again: Multithreaded H.264 decoding with ffmpeg?
Sun Jun 1 20:46:02 CEST 2008
On Sat, 31 May 2008, Michael Niedermayer wrote:
> The question is how many more things one could optimize by not forcing to
> use the same source for 32 and 64bit.
> Switching between register names is one thing but trying to use common code
> where one case has 8 registers and one has 16 just doesnt look like such
> a clear case. Sometimes its likely better to have common code, but always?
> now if its better in yasm to have common code for a specific function why
> does that also have to be the best way in gcc asm?
If it's better in yasm but not gcc to have common code for a specific
function, then that means yasm successfully avoided code duplication while
gcc's limitations made it impossible or unwieldy, which is a vote for
> Iam really a little curious if cleanly written yasm code is so much supperior
> over cleanly written gcc inline asm code. I certainly are no fan of gcc or
> its asm, its mainly the extra dependancy and the loss of support for many
> platforms that annoys me most on this ...
> TRANSPOSE8 is used at 2 spots ...
> TRANSPOSE8(%%xmm4, %%xmm1, %%xmm7, %%xmm3, %%xmm5, %%xmm0, %%xmm2, %%xmm6, (%1))
> "paddw %4, %%xmm4 \n"
> "movdqa %%xmm4, 0x00(%1) \n"
> "movdqa %%xmm2, 0x40(%1) \n"
> H264_IDCT8_1D_SSE2(%%xmm4, %%xmm0, %%xmm6, %%xmm3, %%xmm2, %%xmm5, %%xmm7, %%xmm1)
> "movdqa %%xmm6, 0x60(%1) \n"
> "movdqa %%xmm7, 0x70(%1) \n"
> These movdqa are not needed on x86-64 and i suspect that by not using "common"
> code their number can be reduced on x86-32, more precissely the second looks
> like it could be merged with something from TRANSPOSE8.
Agreed. In x264 I have separate x86_32 and x86_64 version of 8x8 dct.
But in lavc I just wanted to do as little gcc-asm writing as possible, so
I stopped after writing the minimal x86_32 version which can be
compiled on x86_64 but doesn't make much use of the extra registers.
> Also the question of readability has been ignored entirely, is all the
> preprocesor magic be it yasm or c really that good?
> You use alot of preprocessor tricks in your gcc-asm, i just thought it
> might be more flexibl and readable with a little less.
> After all the code would be the same after the preprocessor anyway.
What is your alternative? Write code using preprocessor tricks but then
manually expand them before committing? Anything that reduces code
duplication is a win in terms of ease of writing (no matter how much
magic is involved), but I can understand optimizing for reading at the
expense of writing if you're reasonably sure that the function will
never change again.
> And last ultra finetuned common 64-32 code has another problem. That is
> when one wants to change/optimize the code but she has not both a 32 and
> 64 bit cpu. It could easily lead to a speedloss or considerable more
> work waiting for others to do the benchmarking.
Essentially all asm I've written in the past 3 years was optimized for 64
and for 64-in-32bit-mode, not for any 32bit cpu, so I guess that doesn't
count as ultra finetuned. If you optimize for a specific old cpu and have
reason to believe your change hurts new cpus, then that's another split,
not just 32-64. If you don't have specific reason but just don't have any
64bit cpus to test on, then you not only have code duplication but
non-identical duplication without even being sure that the differences are
If every difference between two near-duplicate functions is documented as
to which cpus it's been tested on and the results thereof (what's the
chance of that?), then my argument on this point is reduced.
> So in the end IMHO maybe less preprocessor based asm code factorization
> would be a better solution than yasm, just my 2cents, iam not opposing yasm
> if people really want it ...
Better? It's a solution to a different problem. I'm asking for yasm so I
can do more preprocessor stuff.
Well, syntax is another reason. I'd prefer
pshufw mm0, [eax+ecx*4+16], 0
"pshufw $0, 16(%%eax,%%ecx,4), %%mm0 \n\t"\
even if that were the only difference.
More information about the ffmpeg-devel