[FFmpeg-devel] [PATCH 4/7] x86: sbrdsp: implement SSE hf_apply_noise

Michael Niedermayer michaelni at gmx.at
Sat Apr 6 22:32:36 CEST 2013


On Sat, Apr 06, 2013 at 11:50:26AM -0700, Jason Garrett-Glaser wrote:
> On Sat, Apr 6, 2013 at 6:44 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
> > On Sat, Apr 06, 2013 at 10:52:11AM +0000, Christophe Gisquet wrote:
> >> 233 to 115(sse)/110(sse2) cycles on Arrandale and Win64.
> >> Replacing the multiplication by s_m[m] by an andps and an xorps with
> >> appropriate vectors is slower. Unrolling is a 15 cycles win.
> >> ---
> >>  libavcodec/x86/sbrdsp.asm    | 145 +++++++++++++++++++++++++++++++++++++++++++
> >>  libavcodec/x86/sbrdsp_init.c |  32 ++++++++++
> >>  2 files changed, 177 insertions(+)
> >>
> >> diff --git a/libavcodec/x86/sbrdsp.asm b/libavcodec/x86/sbrdsp.asm
> >> index 65c972e..a7998fa 100644
> >> --- a/libavcodec/x86/sbrdsp.asm
> >> +++ b/libavcodec/x86/sbrdsp.asm
> >> @@ -26,6 +26,12 @@ SECTION_RODATA
> >>  ps_mask         times 2 dd 1<<31, 0
> >>  ps_mask2        times 2 dd 0, 1<<31
> >>  ps_neg          times 4 dd 1<<31
> >> +ps_noise0       times 2 dd  1.0,  0.0,
> >> +ps_noise2       times 2 dd -1.0,  0.0
> >> +ps_noise13      dd  0.0,  1.0, 0.0, -1.0
> >> +                dd  0.0, -1.0, 0.0,  1.0
> >> +                dd  0.0,  1.0, 0.0, -1.0
> >> +cextern         sbr_noise_table
> >>
> >>  SECTION_TEXT
> >>
> >
> >> @@ -358,3 +364,142 @@ SBR_QMF_DEINT_BFLY
> >>
> >>  INIT_XMM sse2
> >>  SBR_QMF_DEINT_BFLY
> >> +
> >> +%if WIN64
> >> +%define NREGS 0
> >> +%else
> >
> >> +%ifndef PIC
> >
> > ifdef
> >
> >
> > [...]
> >> +%endif
> >> +    mulps      m1, m3 ; m2 = q_filt[m] * ff_sbr_noise_table[noise]
> >> +    mulps      m2, m4 ; m2 = q_filt[m] * ff_sbr_noise_table[noise]
> >> +    mova       m3, [s_mq + count]
> >> +    ; TODO: replace by a vpermd in AVX2
> >
> >> +%if cpuflag(sse2)
> >> +    punpckhdq  m4, m3, m3
> >> +    punpckldq  m3, m3, m3
> >> +%else
> >> +    unpckhps   m4, m3, m3
> >> +    unpcklps   m3, m3, m3
> >> +%endif
> >
> > it might make sense to do something in some header with a macro
> > maybe so that punpckl/dq get turned into unpck* on SSE1
> 
> Maybe modify SBUTTERFLY to do that if SSE1 is on?  SBUTTERFLY is
> basically this macro, I think.

patch below:

From f388deba861f9f081538ddd6f5ec515c05c30ea1 Mon Sep 17 00:00:00 2001
From: Michael Niedermayer <michaelni at gmx.at>
Date: Sat, 6 Apr 2013 22:28:20 +0200
Subject: [PATCH] avutil/x86util: Support SBUTTERFLY dw with SSE1

Idea-by: Jason Garrett-Glaser <darkshikari at gmail.com>
Signed-off-by: Michael Niedermayer <michaelni at gmx.at>
---
 libavutil/x86/x86util.asm |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/libavutil/x86/x86util.asm b/libavutil/x86/x86util.asm
index 8908444..33737ea 100644
--- a/libavutil/x86/x86util.asm
+++ b/libavutil/x86/x86util.asm
@@ -30,7 +30,15 @@
 %include "libavutil/x86/x86inc.asm"

 %macro SBUTTERFLY 4
-%if avx_enabled == 0
+%if cpuflag(sse2) == 0 && mmsize == 16
+%ifidn %1, dw
+    mova      m%4, m%2
+    unpcklps  m%2, m%3
+    unpckhps  m%4, m%3
+%else
+%error only dw unpack is supported by SBUTTERFLY on SSE1
+%endif
+%elif avx_enabled == 0
     mova      m%4, m%2
     punpckl%1 m%2, m%3
     punpckh%1 m%4, m%3
--
1.7.9.5

-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Complexity theory is the science of finding the exact solution to an
approximation. Benchmarking OTOH is finding an approximation of the exact
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20130406/53220033/attachment.asc>


More information about the ffmpeg-devel mailing list