[FFmpeg-devel] [PATCH] Optimization of AMR NB and WB decoders for MIPS
vitor1001 at gmail.com
Fri May 25 22:46:11 CEST 2012
On 05/24/2012 06:30 PM, Babic, Nedeljko wrote:
>>> Hi Vitor,
>>>> av_always_inline in a .c file? I don't think it can work...
>>> you are right. I'll change this.
>>>> Here and in several other places: in general, it is a good idea to do
>>>> your loops also in ASM. That way, you can be sure that the compiler
>>>> cannot slow-down a otherwise good code.
>>> Good point. However, compilers are generally doing good job optimizing this
>>> kind of code for MIPS. Anyhow, loops are left in C only in places where
>>> compilers can’t do a lot of damage, or where it would be too costly to do the
>>> loops in ASM (for example because of constraint on number of named registers
>>> that can be used in inline assembly)
>> Would you mind compiling these files with -S and send the ASM output?
>> One useless instruction in a inner loop can be rather costly.
> In attached file acelp_filters_mips.S you can see what compiler generates for loop
> in question.
> Compiler adds only counter increment in the loop: "addiu $11,$11,1".
> Taking in consideration that architecture for which this was optimized has dual
> out-of-order instruction issue and that FPU has separate in-order dual-issue pipeline
> decoupled from integer pipeline this doesn't bring cost in performance.
> On the other hand I have to admit that for older architectures it would be better to
> write entire loop in ASM and counter incrementing would be eliminated.
> So my question is do you want me to rewrite loops in ASM for cases like this?
Indeed, it looks like the compiler do a mostly good job. So I'm fine
with leaving the loops in C.
>>>> If you do such unrolling, please say in the documentation of the
>>>> function (in acelp_filter.h) that n should be a multiple of eight or
>>>> any other thing that it requires. Also, please note that this code is
>>>> not only used for AMR, but also for other vocoders. Can you check if
>>>> the other uses respect the multiple-of-eight constraint?
>>> We checked if unrolling can be done (regarding multiple-of-eight constraint) in all
>>> the places where this function is called. The same goes for all the function where
>>> we done loop unrolling.
>>> Comment will be put in documentation of the function. Should I specify that the
>>> comment is just for MIPS optimizations?
>>>> Hmm, am I missing something or you are doing all this to save a single
>>>> pre-calculation of (1 + lsp[lp_order - 1])?
>>> You are correct. We decided to optimize this function (in respect to our profiling
>>> results)and this was the only meaningful thing to do with our instruction set. We got
>>> some speed up on this function and decided to leave it in the code taking in account
>>> document optimization.txt (section: When is an optimization justified? ).
>> Can you attach the ASM compiler output of the C version and yours? I
>> don't really understand why your is faster. Maybe the compiler is to
>> dump to see that lsp and qa does not alias...
> In attached files lsp_c.S and lsp_mips.S are ASM compiler outputs for
> C version and our version respectively.
> I didn't explained main reason for this optimization correctly. As you can
> see in ASM outputs, compiler doesn't use FP MADD instructions although it can
> benefit from them in this case. That is the main reason for this optimization.
How can does exactly ff_acelp_lspd2lpc() benefit from FP MADD? By saving
a single add in every call of the function? Unless madds are faster than
plain muls, it doesn't looks like a benchmarkable gain. I'd say it has
more to do with allowing the compiler to inline the call to the mips
version of ff_lsp2polyf()...
Can you do a test? Copy the unmodified C version of ff_acelp_lspd2lpc to
mips/amrwb_lsp2lpc.h and see if it is any slower than the madd version.
More information about the ffmpeg-devel