[FFmpeg-devel] [PATCH] Remove mmx.h from dsputil_mmx.c

Thu Apr 2 22:25:13 CEST 2009

On Wed, Mar 25, 2009 at 8:10 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
> On Wed, Mar 25, 2009 at 04:45:16PM -0400, Alex Converse wrote:
>> On Mon, Mar 23, 2009 at 8:48 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
>> > On Mon, Mar 23, 2009 at 07:38:49PM -0400, Alex Converse wrote:
>> >> 2009/3/14 Michael Niedermayer <michaelni at gmx.at>:
>> >> > On Sat, Mar 14, 2009 at 12:37:54PM -0400, Alex Converse wrote:
>> >> >> On Sat, Mar 14, 2009 at 1:05 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
>> >> >> > On Fri, Mar 13, 2009 at 08:38:09PM -0400, Alex Converse wrote:
>> >> >> >> On Wed, Mar 11, 2009 at 3:44 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
>> >> >> >> > On Wed, Mar 11, 2009 at 03:19:03AM -0400, Alex Converse wrote:
>> >> >> >> >> On Tue, Mar 10, 2009 at 10:04 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
>> >> >> >> >> > On Tue, Mar 10, 2009 at 09:19:56PM -0400, Alex Converse wrote:
>> >> >> >> >> >> This is my first mmx patch so please bear with me.
>> >> >> >> >> >
>> >> >> >> >> >> diff --git a/libavcodec/x86/dsputil_mmx.c b/libavcodec/x86/dsputil_mmx.c
>> >> >> >> >> >> index 2f47f5f..3771c5f 100644
>> >> >> >> >> >> --- a/libavcodec/x86/dsputil_mmx.c
>> >> >> >> >> >> +++ b/libavcodec/x86/dsputil_mmx.c
>> >> >> >> >> >> @@ -28,7 +28,6 @@
>> >> >> >> >> >> ?#include "libavcodec/mpegvideo.h"
>> >> >> >> >> >> ?#include "libavcodec/simple_idct.h"
>> >> >> >> >> >> ?#include "dsputil_mmx.h"
>> >> >> >> >> >> -#include "mmx.h"
>> >> >> >> >> >> ?#include "vp3dsp_mmx.h"
>> >> >> >> >> >> ?#include "vp3dsp_sse2.h"
>> >> >> >> >> >> ?#include "vp6dsp_mmx.h"
>> >> >> >> >> >> @@ -278,16 +277,33 @@ static DECLARE_ALIGNED_8(const unsigned char, vector128[8]) =
>> >> >> >> >> >>
>> >> >> >> >> >> ?void put_signed_pixels_clamped_mmx(const DCTELEM *block, uint8_t *pixels, int line_size)
>> >> >> >> >> >> ?{
>> >> >> >> >> >> + ? ?const DCTELEM *p = block;
>> >> >> >> >> >> + ? ?uint8_t *pix = pixels;
>> >> >> >> >> >> ? ? ?int i;
>> >> >> >> >> >>
>> >> >> >> >> >> - ? ?movq_m2r(*vector128, mm1);
>> >> >> >> >> >> - ? ?for (i = 0; i < 8; i++) {
>> >> >> >> >> >> - ? ? ? ?movq_m2r(*(block), mm0);
>> >> >> >> >> >> - ? ? ? ?packsswb_m2r(*(block + 4), mm0);
>> >> >> >> >> >> - ? ? ? ?block += 8;
>> >> >> >> >> >> - ? ? ? ?paddb_r2r(mm1, mm0);
>> >> >> >> >> >> - ? ? ? ?movq_r2m(mm0, *pixels);
>> >> >> >> >> >> - ? ? ? ?pixels += line_size;
>> >> >> >> >> >> + ? ?__asm__ volatile ("movq %0, %%mm0" : : "m" (*vector128));
>> >> >> >> >> >> + ? ?for (i = 0; i < 8; i+=4) {
>> >> >> >> >> >> + ? ? ? ?__asm__ volatile (
>> >> >> >> >> >> + ? ? ? ? ? ? ? ?"movq ? (%1), %%mm1 ? ? ? ? ? ? \n\t"
>> >> >> >> >> >> + ? ? ? ? ? ? ? ?"movq 16(%1), %%mm2 ? ? ? ? ? ? \n\t"
>> >> >> >> >> >> + ? ? ? ? ? ? ? ?"movq 32(%1), %%mm3 ? ? ? ? ? ? \n\t"
>> >> >> >> >> >> + ? ? ? ? ? ? ? ?"movq 48(%1), %%mm4 ? ? ? ? ? ? \n\t"
>> >> >> >> >> >> + ? ? ? ? ? ? ? ?"packsswb ?8(%1), %%mm1 ? ? ? ? \n\t"
>> >> >> >> >> >> + ? ? ? ? ? ? ? ?"packsswb 24(%1), %%mm2 ? ? ? ? \n\t"
>> >> >> >> >> >> + ? ? ? ? ? ? ? ?"packsswb 40(%1), %%mm3 ? ? ? ? \n\t"
>> >> >> >> >> >> + ? ? ? ? ? ? ? ?"packsswb 56(%1), %%mm4 ? ? ? ? \n\t"
>> >> >> >> >> >> + ? ? ? ? ? ? ? ?"paddb %%mm0, %%mm1 ? ? ? ? ? ? \n\t"
>> >> >> >> >> >> + ? ? ? ? ? ? ? ?"paddb %%mm0, %%mm2 ? ? ? ? ? ? \n\t"
>> >> >> >> >> >> + ? ? ? ? ? ? ? ?"paddb %%mm0, %%mm3 ? ? ? ? ? ? \n\t"
>> >> >> >> >> >> + ? ? ? ? ? ? ? ?"paddb %%mm0, %%mm4 ? ? ? ? ? ? \n\t"
>> >> >> >> >> >> + ? ? ? ? ? ? ? ?"movq %%mm1, (%0) ? ? ? ? ? ? ? \n\t"
>> >> >> >> >> >> + ? ? ? ? ? ? ? ?"movq %%mm2, (%0, %2) ? ? ? ? ? \n\t"
>> >> >> >> >> >> + ? ? ? ? ? ? ? ?"movq %%mm3, (%0, %2, 2) ? ? ? ?\n\t"
>> >> >> >> >> >> + ? ? ? ? ? ? ? ?"movq %%mm4, (%0, %3) ? ? ? ? ? \n\t"
>> >> >> >> >> >> + ? ? ? ? ? ? ? ?::"r" (pix), "r" (p), "r"((x86_reg)line_size), "r"((x86_reg)3*line_size)
>> >> >> >> >> >> + ? ? ? ? ? ? ? ?:"memory");
>> >> >> >> >> >> + ? ? ? ?p += 32;
>> >> >> >> >> >> + ? ? ? ?pix += 4*line_size;
>> >> >> >> >> >
>> >> >> >> >> > 1. the for() should not be there, it should be in the asm()
>> >> >> >> >>
>> >> >> >> >> The for loop has been eliminated
>> >> >> >> >>
>> >> >> >> >> > 2. benchmark is always needed when messing with optimized code.
>> >> >> >> >> > ? ?START/STOP_TIMER around the point that calls the changed
>> >> >> >> >> > ? ?function is likely the easiest way to do it
>> >> >> >> >>
>> >> >> >> >> The benchmarks seem inconclusive. The unpatched version seems to start
>> >> >> >> >> out faster and end slower. I don't know what of make of this.
>> >> >> >> >
>> >> >> >> > [...]
>> >> >> >> >
>> >> >> >> >> diff --git a/libavcodec/x86/dsputil_mmx.c b/libavcodec/x86/dsputil_mmx.c
>> >> >> >> >> index 2f47f5f..b9679ad 100644
>> >> >> >> >> --- a/libavcodec/x86/dsputil_mmx.c
>> >> >> >> >> +++ b/libavcodec/x86/dsputil_mmx.c
>> >> >> >> >> @@ -28,7 +28,6 @@
>> >> >> >> >> ?#include "libavcodec/mpegvideo.h"
>> >> >> >> >> ?#include "libavcodec/simple_idct.h"
>> >> >> >> >> ?#include "dsputil_mmx.h"
>> >> >> >> >> -#include "mmx.h"
>> >> >> >> >> ?#include "vp3dsp_mmx.h"
>> >> >> >> >> ?#include "vp3dsp_sse2.h"
>> >> >> >> >> ?#include "vp6dsp_mmx.h"
>> >> >> >> >> @@ -276,19 +275,39 @@ void put_pixels_clamped_mmx(const DCTELEM *block, uint8_t *pixels, int line_size
>> >> >> >> >> ?static DECLARE_ALIGNED_8(const unsigned char, vector128[8]) =
>> >> >> >> >> ? ?{ 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80 };
>> >> >> >> >>
>> >> >> >> >> +#define put_signed_pixels_clamped_mmx_half \
>> >> >> >> >> + ? ? ? ? ? ?"movq ? (%1), %%mm1 ? ? ? ? ? ? ? ? \n\t"\
>> >> >> >> >> + ? ? ? ? ? ?"movq 16(%1), %%mm2 ? ? ? ? ? ? ? ? \n\t"\
>> >> >> >> >> + ? ? ? ? ? ?"movq 32(%1), %%mm3 ? ? ? ? ? ? ? ? \n\t"\
>> >> >> >> >> + ? ? ? ? ? ?"movq 48(%1), %%mm4 ? ? ? ? ? ? ? ? \n\t"\
>> >> >> >> >> + ? ? ? ? ? ?"packsswb ?8(%1), %%mm1 ? ? ? ? ? ? \n\t"\
>> >> >> >> >> + ? ? ? ? ? ?"packsswb 24(%1), %%mm2 ? ? ? ? ? ? \n\t"\
>> >> >> >> >> + ? ? ? ? ? ?"packsswb 40(%1), %%mm3 ? ? ? ? ? ? \n\t"\
>> >> >> >> >> + ? ? ? ? ? ?"packsswb 56(%1), %%mm4 ? ? ? ? ? ? \n\t"\
>> >> >> >> >> + ? ? ? ? ? ?"paddb %%mm0, %%mm1 ? ? ? ? ? ? ? ? \n\t"\
>> >> >> >> >> + ? ? ? ? ? ?"paddb %%mm0, %%mm2 ? ? ? ? ? ? ? ? \n\t"\
>> >> >> >> >> + ? ? ? ? ? ?"paddb %%mm0, %%mm3 ? ? ? ? ? ? ? ? \n\t"\
>> >> >> >> >> + ? ? ? ? ? ?"paddb %%mm0, %%mm4 ? ? ? ? ? ? ? ? \n\t"\
>> >> >> >> >> + ? ? ? ? ? ?"movq %%mm1, (%0) ? ? ? ? ? ? ? ? ? \n\t"\
>> >> >> >> >> + ? ? ? ? ? ?"movq %%mm2, (%0, %2) ? ? ? ? ? ? ? \n\t"\
>> >> >> >> >> + ? ? ? ? ? ?"movq %%mm3, (%0, %2, 2) ? ? ? ? ? ?\n\t"\
>> >> >> >> >> + ? ? ? ? ? ?"movq %%mm4, (%0, %3) ? ? ? ? ? ? ? \n\t"
>> >> >> >> >> +
>> >> >> >> >> ?void put_signed_pixels_clamped_mmx(const DCTELEM *block, uint8_t *pixels, int line_size)
>> >> >> >> >> ?{
>> >> >> >> >> - ? ?int i;
>> >> >> >> >> -
>> >> >> >> >> - ? ?movq_m2r(*vector128, mm1);
>> >> >> >> >> - ? ?for (i = 0; i < 8; i++) {
>> >> >> >> >> - ? ? ? ?movq_m2r(*(block), mm0);
>> >> >> >> >> - ? ? ? ?packsswb_m2r(*(block + 4), mm0);
>> >> >> >> >> - ? ? ? ?block += 8;
>> >> >> >> >> - ? ? ? ?paddb_r2r(mm1, mm0);
>> >> >> >> >> - ? ? ? ?movq_r2m(mm0, *pixels);
>> >> >> >> >> - ? ? ? ?pixels += line_size;
>> >> >> >> >> - ? ?}
>> >> >> >> >> + ? ?const DCTELEM *p = block;
>> >> >> >> >> + ? ?uint8_t *pix = pixels;
>> >> >> >> >> + ? ?x86_reg line_skip = line_size;
>> >> >> >> >> +
>> >> >> >> >> + ? ?__asm__ volatile (
>> >> >> >> >> + ? ? ? ? ? ?"movq (%3), %%mm0 ? ? ? ? ? ? ? ? ? \n\t"
>> >> >> >> >> + ? ? ? ? ? ?"lea (%2, %2, 2), %3 ? ? ? ? ? ? ? ?\n\t"
>> >> >> >> >> + ? ? ? ? ? ?put_signed_pixels_clamped_mmx_half
>> >> >> >> >> + ? ? ? ? ? ?"add $64, %1 ? ? ? ? ? ? ? ? ? ? ? ?\n\t"
>> >> >> >> >> + ? ? ? ? ? ?"lea (%0, %2, 4), %0 ? ? ? ? ? ? ? ?\n\t"
>> >> >> >> >> + ? ? ? ? ? ?put_signed_pixels_clamped_mmx_half
>> >> >> >> >> + ? ? ? ? ? ?::"r" (pix), "r" (p), "r"(line_skip), "r" (vector128)
>> >> >> >> >> + ? ? ? ? ? ?:"memory");
>> >> >> >> >
>> >> >> >> > you dont need the add $64, this can be done by changing the offsets
>> >> >> >> > also %0,%1 and %3 are marked as input, while you change them
>> >> >> >> >
>> >> >> >>
>> >> >> >> Fixed
>> >> >> >>
>> >> >> >> Do the x87 stack or the mm* registers need to be marked as clobbered?
>> >> >> >
>> >> >> > strictly yes but we dont do it because it works and it might be faster
>> >> >> >
>> >> >> >>
>> >> >> >> Here is cycle estimation from Valgrind. I find it easier to read:
>> >> >> >> ? ? ? ? Ir ? ? ? ? Dr ? ? ? ? Dw ? ? I1mr ? ?D1mr ? ? D1mw ?I2mr ?D2mr ? ?D2mw
>> >> >> >> function put_signed_pixels_clamped_mmx:
>> >> >> >> before:
>> >> >> >> 16,088,784 ?6,033,294 ?2,681,464 ?122,988 ?34,505 ?259,448 ? 991 ? 292 ?27,624
>> >> >> >> ? Cycle Estimation = Ir + 10*L1m + 100*L2m = 23,184,894
>> >> >> >> after:
>> >> >> >> 12,736,954 ?6,033,294 ?2,681,464 ? 64,595 ?33,784 ?258,686 ? 617 ? 294 ?27,622
>> >> >> >> ? Cycle Estimation = Ir + 10*L1m + 100*L2m = 19,160,904
>> >> >> >
>> >> >> > nothing against valgrind but i prefer benchmarks done on real hardware
>> >> >> > where its possible because we have more users that run ffmpeg on real
>> >> >> > hw than who run it in emulators
>> >> >> >
>> >> >>
>> >> >> Well is there a guide to interpreting those benchmarks? I can't make
>> >> >> heads or tails of them.
>> >> >
>> >> > You can just write a tiny test program that runs a loop with START/STOP_TIMER
>> >> > 10000 times and have the inner loop do as many iterations as
>> >> > a command line arg says
>> >> > then run it with 100 and 101 and look at the benchmark scores.
>> >> >
>> >> > [...]
>> >> >
>> >>
>> >> I wrote a little timer test program, and when going from 100 to 101
>> >> all the numbers increased like I expected. I still am confused about
>> >> how to interpret the numbers I posted.
>> >
>> > hmm
>> > the value with the largest run number should be a better indicator of what
>> > is faster
>> >
>>
>> Thanks, though this time I got my patched version doing better across
>> all runs for some reason.
>>
>
>> 881 dezicycles in 1st call, 8388543 runs, 65 skips
>> vs.
>> 862 dezicycles in 1st call, 8388524 runs, 84 skips
>
> benchmark looks good
>

benchmark still looks similar

> [...]
>> ?void put_signed_pixels_clamped_mmx(const DCTELEM *block, uint8_t *pixels, int line_size)
>> ?{
>> - ? ?int i;
>> -
>> - ? ?movq_m2r(*vector128, mm1);
>> - ? ?for (i = 0; i < 8; i++) {
>> - ? ? ? ?movq_m2r(*(block), mm0);
>> - ? ? ? ?packsswb_m2r(*(block + 4), mm0);
>> - ? ? ? ?block += 8;
>> - ? ? ? ?paddb_r2r(mm1, mm0);
>> - ? ? ? ?movq_r2m(mm0, *pixels);
>> - ? ? ? ?pixels += line_size;
>> - ? ?}
>
>> + ? ?const DCTELEM *p = block;
>
> useless temporary
>

fixed

>
>> + ? ?uint8_t *pix = pixels;
>> + ? ?x86_reg line_skip = line_size;
>> + ? ?const unsigned char* vec = vector128;
>> +
>> + ? ?__asm__ volatile (
>> + ? ? ? ? ? ?"movq (%1), %%mm0 ? ? ? ? ? ? ? ? ? \n\t"
>
> you can use MANGLE() to access vector128 without a register
>

fixed

[...]

Thanks for bearing with me,

Alex Converse
-------------- next part --------------
FFmpeg version git-da55d9c, Copyright (c) 2000-2009 Fabrice Bellard, et al.
  configuration: --enable-gpl --disable-ffserver
  libavutil     50. 2. 0 / 50. 2. 0
  libavcodec    52.22. 3 / 52.22. 3
  libavformat   52.32. 0 / 52.32. 0
  libavdevice   52. 1. 0 / 52. 1. 0
  libswscale     0. 7. 1 /  0. 7. 1
  built on Apr  2 2009 15:36:07, gcc: 4.3.3
Output #0, null, to 'pipe:':
    Stream #0.0: Video: rawvideo, yuv420p, 1280x720, q=2-31, 200 kb/s, 90k tbn, 59.92 tbc
8640 dezicycles in 1st call, 1 runs, 0 skips
5085 dezicycles in 1st call, 2 runs, 0 skips
3217 dezicycles in 1st call, 4 runs, 0 skips
2272 dezicycles in 1st call, 8 runs, 0 skips
1766 dezicycles in 1st call, 16 runs, 0 skips
1524 dezicycles in 1st call, 32 runs, 0 skips
1440 dezicycles in 1st call, 64 runs, 0 skips
1397 dezicycles in 1st call, 128 runs, 0 skips
1351 dezicycles in 1st call, 256 runs, 0 skips
1343 dezicycles in 1st call, 512 runs, 0 skips
1322 dezicycles in 1st call, 1024 runs, 0 skips
1315 dezicycles in 1st call, 2048 runs, 0 skips
1318 dezicycles in 1st call, 4096 runs, 0 skips
1314 dezicycles in 1st call, 8192 runs, 0 skips
1312 dezicycles in 1st call, 16384 runs, 0 skips
1311 dezicycles in 1st call, 32768 runs, 0 skips
1150 dezicycles in 1st call, 65536 runs, 0 skips
1015 dezicycles in 1st call, 131069 runs, 3 skips
948 dezicycles in 1st call, 262140 runs, 4 skips
914 dezicycles in 1st call, 524279 runs, 9 skips
898 dezicycles in 1st call, 1048565 runs, 11 skips
889 dezicycles in 1st call, 2097129 runs, 23 skips
1260 dezicycles in 2nd call, 1 runs, 0 skips
1260 dezicycles in 2nd call, 2 runs, 0 skips
1327 dezicycles in 2nd call, 4 runs, 0 skips
1181 dezicycles in 2nd call, 8 runs, 0 skips
1102 dezicycles in 2nd call, 16 runs, 0 skips
1057 dezicycles in 2nd call, 32 runs, 0 skips
1032 dezicycles in 2nd call, 64 runs, 0 skips
1016 dezicycles in 2nd call, 128 runs, 0 skips
1003 dezicycles in 2nd call, 256 runs, 0 skips
886 dezicycles in 1st call, 4194260 runs, 44 skips
991 dezicycles in 2nd call, 512 runs, 0 skips
989 dezicycles in 2nd call, 1024 runs, 0 skips
989 dezicycles in 2nd call, 2048 runs, 0 skips
993 dezicycles in 2nd call, 4096 runs, 0 skips
884 dezicycles in 1st call, 8388523 runs, 85 skips
-------------- next part --------------
A non-text attachment was scrubbed...
Name: put_signed_pixels_clamped_mmx.diff
Type: text/x-patch
Size: 2577 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20090402/c9aede1e/attachment.bin>
-------------- next part --------------
FFmpeg version git-da55d9c, Copyright (c) 2000-2009 Fabrice Bellard, et al.
  configuration: --enable-gpl --disable-ffserver
  libavutil     50. 2. 0 / 50. 2. 0
  libavcodec    52.22. 3 / 52.22. 3
  libavformat   52.32. 0 / 52.32. 0
  libavdevice   52. 1. 0 / 52. 1. 0
  libswscale     0. 7. 1 /  0. 7. 1
  built on Apr  2 2009 15:36:07, gcc: 4.3.3
Output #0, null, to 'pipe:':
    Stream #0.0: Video: rawvideo, yuv420p, 1280x720, q=2-31, 200 kb/s, 90k tbn, 59.92 tbc
6480 dezicycles in 1st call, 1 runs, 0 skips
3915 dezicycles in 1st call, 2 runs, 0 skips
2722 dezicycles in 1st call, 4 runs, 0 skips
2002 dezicycles in 1st call, 8 runs, 0 skips
1659 dezicycles in 1st call, 16 runs, 0 skips
1473 dezicycles in 1st call, 32 runs, 0 skips
1385 dezicycles in 1st call, 64 runs, 0 skips
1334 dezicycles in 1st call, 128 runs, 0 skips
1305 dezicycles in 1st call, 256 runs, 0 skips
1303 dezicycles in 1st call, 512 runs, 0 skips
1290 dezicycles in 1st call, 1024 runs, 0 skips
1283 dezicycles in 1st call, 2048 runs, 0 skips
1286 dezicycles in 1st call, 4096 runs, 0 skips
1286 dezicycles in 1st call, 8192 runs, 0 skips
1163 dezicycles in 1st call, 16383 runs, 1 skips
1010 dezicycles in 1st call, 32766 runs, 2 skips
933 dezicycles in 1st call, 65534 runs, 2 skips
893 dezicycles in 1st call, 131070 runs, 2 skips
874 dezicycles in 1st call, 262142 runs, 2 skips
863 dezicycles in 1st call, 524286 runs, 2 skips
860 dezicycles in 1st call, 1048569 runs, 7 skips
857 dezicycles in 1st call, 2097140 runs, 12 skips
1170 dezicycles in 2nd call, 1 runs, 0 skips
1440 dezicycles in 2nd call, 2 runs, 0 skips
1170 dezicycles in 2nd call, 4 runs, 0 skips
1091 dezicycles in 2nd call, 8 runs, 0 skips
1046 dezicycles in 2nd call, 16 runs, 0 skips
1046 dezicycles in 2nd call, 32 runs, 0 skips
1013 dezicycles in 2nd call, 64 runs, 0 skips
994 dezicycles in 2nd call, 128 runs, 0 skips
973 dezicycles in 2nd call, 256 runs, 0 skips
856 dezicycles in 1st call, 4194275 runs, 29 skips
952 dezicycles in 2nd call, 512 runs, 0 skips
943 dezicycles in 2nd call, 1024 runs, 0 skips
944 dezicycles in 2nd call, 2048 runs, 0 skips
943 dezicycles in 2nd call, 4096 runs, 0 skips
856 dezicycles in 1st call, 8388526 runs, 82 skips