[FFmpeg-devel] [PATCH 2/4] avcodec/mips: MSA (MIPS-SIMD-Arch) optimizations for VP9 lpf functions

Shivraj Patil Shivraj.Patil at imgtec.com
Mon Jul 20 16:30:39 CEST 2015


Hi,

On Thu, Jul 9, 2015 at 9:15 AM, <shivraj.patil at imgtec.com<mailto:shivraj.patil at imgtec.com>> wrote:
+    if (__msa_test_bz_v(flat)) {
+        p1_d = __msa_copy_u_d((v2i64) p1_out, 0);
+        p0_d = __msa_copy_u_d((v2i64) p0_out, 0);
+        q0_d = __msa_copy_u_d((v2i64) q0_out, 0);
+        q1_d = __msa_copy_u_d((v2i64) q1_out, 0);
+        SD4(p1_d, p0_d, q0_d, q1_d, (src - 2 * pitch), pitch);
+    } else {

Can you elaborate on what this does? Does it check that none of the pixels in the vector of cols/rows has flat=1, and takes a shortcut if that's true? Of something else? (If I'm right in my assumption, can you please add a comment to that effect?)

Your assumption is correct. I will add appropriate comments.

+static void vp9_lpf_vertical_16_dual_msa(uint8_t *src, int32_t pitch,
+                                         uint8_t *b_limit_ptr,
+                                         uint8_t *limit_ptr,
+                                         uint8_t *thresh_ptr)
+{
+    uint8_t early_exit = 0;
+    uint8_t transposed_input[16 * 24] ALLOC_ALIGNED(ALIGNMENT);
+    uint8_t *filter48 = &transposed_input[16 * 16];
+
+    vp9_transpose_16x16((src - 8), pitch, &transposed_input[0], 16);
+
+    early_exit = vp9_vt_lpf_t4_and_t8_16w((transposed_input + 16 * 8),
+                                          &filter48[0], src, pitch,
+                                          b_limit_ptr, limit_ptr, thresh_ptr);
+
+    if (0 == early_exit) {
+        early_exit = vp9_vt_lpf_t16_16w((transposed_input + 16 * 8), src, pitch,
+                                        &filter48[0]);
+
+        if (0 == early_exit) {
+            vp9_transpose_16x16(transposed_input, 16, (src - 8), pitch);
+        }
+    }
+}

Since no state is shared between t16 and t4/t8, it suggests you're calculating some of the filters twice (since part of the condition of whether to apply the t16 filter is whether to apply the t8 filter), is that true? If so, do you think it's worth modifying this so the check on whether to run t4 or t8 is not re-evaluated in t16?

t4 and t8 are not re-evaluated in t16 filter function “vp9_vt_lpf_t16_16w”.
t4 and t8 filters are applied in function “vp9_vt_lpf_t4_and_t8_16w” and output is stored at location pointed by “filter48”, which is later used in t16 function based on early exit flag.

+void ff_loop_filter_v_84_16_msa(uint8_t *src, ptrdiff_t stride,
+                                int32_t e, int32_t i, int32_t h)
+{
+    uint8_t e1, i1, h1;
+    uint8_t e2, i2, h2;
+
+    e1 = e & 0xff;
+    i1 = i & 0xff;
+    h1 = h & 0xff;
+
+    e2 = e >> 8;
+    i2 = i >> 8;
+    h2 = h >> 8;
+
+    vp9_lpf_horizontal_8_msa(src, stride, &e1, &i1, &h1, 1);
+    vp9_lpf_horizontal_4_msa(src + 8, stride, &e2, &i2, &h2, 1);
+}

So I think you're missing the point of why this exists. The simd code for e.g. 88_16 suggests you're capable of doing 16 pixels at once in a single iteration, right? The idea here is that you can use the fact that t4 is a strict subset of t8 to run them both in the same iteration, with simply a mask at the end to assure that "whether to run t8 or t4" for the t4 half of the pixels is always 0. Look at the x86 simd code for details on how that would work exactly.

Agreed, will incorporate the same.

Shivraj


More information about the ffmpeg-devel mailing list