[Ffmpeg-devel-irc] ffmpeg-devel.log.20140628

Sun Jun 29 02:05:02 CEST 2014

[00:35] <cone-849> ffmpeg.git 03Michael Niedermayer 07master:4d1fa38d984c: avcodec/mpeg12dec: Limit progressive_seq reinitilaization to where the resolution changes
[03:02] <BBB> jamrial: new patchset on my github, common resample functions are now quite a bit faster than gcc
[03:03] <BBB> linear not yet updated, so not yet ready to merge, but it shows progress
[04:04] <jamrial> BBB: nice!
[04:05] <jamrial> linear should also be faster considering you optimized the horizontal sum after the inner loop
[04:06] <BBB> right, but I havent updated them to this new style/methodology yet
[04:07] <BBB> so its still the old function, which is probably slower than gcc
[04:07] <BBB> (I really wanted to try to get the fullest out of the common one first)
[04:28] <fionag> BBB: https://github.com/rbultje/ffmpeg/commit/1f7d5e5af5ae41e3eb1e4fe423797ba9247736c4 ?
[04:30] <BBB> fionag: yes
[04:31] <fionag> are suggestions/comments okay?
[04:31] <BBB> sure
[04:32] <BBB> dont comment on the linear function yet, Im about to re-do that
[04:32] <fionag> which one is the linear function?
[04:32] <fionag> oh, the second one, I see.
[04:32] <BBB> (resample_linear_float, line 236 and below in resample.asm)
[04:32] <fionag> darn, there went my comment :P
[04:32] Action: fionag looks only in the upper function then
[04:33] <BBB> sorry :]
[04:33] <BBB> what was the comment? Im curious
[04:33] Action: fionag was thinking that if line 339's speed matters at all, maybe cache 1 / c->src_incr and multiply by it?
[04:34] <fionag> I'm also not quite sure what line 180 is doing, it feels like it serves no purpose but I must be missing something
[04:36] <BBB> look at addps 3 lines down
[04:36] <BBB> a = b; shuffle b around; b += a
[04:38] <fionag> it's doing a horizontal sum though, right?
[04:38] <fionag> the one on lines 344-348 does that without an extra moss.
[04:38] <fionag> *movss
[04:38] <BBB> its a horizontal add, basically x=abcd; y=movhlps(x) /* i.e. cd */; x += y /* now, x=a+c, b+d, & */; y = x; x = shufps(x) /* now x is b+d, & */; y += x /* y is a+b+c+d */;
[04:38] <fionag> so I was wondering why they needed to be different
[04:38] <BBB> well theres a movss, but its hidden
[04:38] <BBB> +    shufps           xm1, xm0, xm0, q0001
[04:39] <BBB> is actually movaps xmm1, xmm0; shufps xmm1, xmm0, q0001
[04:39] <fionag> yeah, but that doesn't work on line 180-181 for some reason?
[04:39] <BBB> movaps and movss have similar complexity if its register-to-register
[04:39] <BBB> it does, I should probably change it
[04:39] <fionag> that function supports AVX, sooo
[04:39] <fionag> other than that I can't think of much of anything, it looks really well done
[04:39] <BBB> yeah itd help for avx
[04:40] <BBB> sadly I dont have an avx capable machine
[04:40] <fionag> though out of curiosity, lines 139-146: isn't the "typical" for that to name them varM instead of varD because they're in memory instead of registers?   or am I remembering wrong
[04:40] <BBB> pushed new version
[04:41] <BBB> yeah it is; reason I do that is so I can share the main loop between x86-32 and -64
[04:41] <BBB> its probably slightly confusing for a casual reader but I figured it made sense
[04:41] <BBB> maybe I should rename it to something that doesnt end in d or q or m, to indicate it can be both
[04:42] <BBB> the d currently indicates that on x86-64, its a register (and on x86-32, it may be, but probably is not)
[04:43] <fionag> ahhhh, I see
[04:43] <fionag> that makes sense then, maybe just a comment explanining it would be fine
[04:44] <fionag> naming based on x86-64 (the more sane architecture) sounds reasonable
[04:45] <BBB> its just nice to have more registers ;)
[04:45] <BBB> esp. in a function like this
[04:46] <fionag> for the inner loop in the first function, is there any benefit to unrolling it?   I guess maybe not because it's probably load-bound (?)
[04:46] <fionag> oh!  is it just me or is there room for an fmaddps on line 167 :3
[04:46] <jamrial> there is
[04:48] <jamrial> fma3/4 and xop were part of my plans before BBB scolded me for writing inline avx :P
[04:49] <BBB> fionag: I suppose you could unroll it& I was basically planning to go the other way and do multiple output samples simultaneouls
[04:49] <BBB> *simultaneously
[04:50] <BBB> but Ill do that after the main rewrite is in
[04:58] <BBB> fionag: Ill make a note about the divss -> mulps (and outside loop preparation of that register), gcc does prepare it outside the loop but keeps using divss (I guess it doesnt really have a choice, we did specifically ask for a float div&)
[05:08] <fionag> yeah, it can't do that optimization without -ffast-math (and I'm not even sure it can do it wth)
[05:09] <fionag> what do you mean by multiple output samples simultaneously?   oh gosh, is this inner loop calculating a -single- output sample? o_o
[05:17] <michaelni> jamrial, seems the mmxext emu edge code causes valgrind failures
[05:18] <michaelni> Invalid read of size 4 at 0xCC0CC0: ??? (videodsp.asm:438)
[05:20] <michaelni> jamrial, should i revert the commit ? from naive quick look fixing it would kind of undo the SPLATB_LOAD change
[05:22] <jamrial> i assume it's the "movd %1, [%2-3]" part of SPLATB_LOAD, right?
[05:23] <jamrial> anyway, yes, the whole point of this change was to get an ssse3 version, and since it didn't seem to be any faster than the mmxext then we can just undo it all
[05:30] <michaelni> jamrial, valgrind just pointed at the macro not at the instruction, so iam only 99% sure it was the movd 
[05:31] <michaelni> ill double check
[05:32] <jamrial> well, not many candidates considering it's the only instruction reading four bytes in any way
[05:33] <michaelni> yes, confirmed commenting the 2 movd removed the valgrind issue
[05:33] <michaelni> ill revert the commit, thanks
[05:40] <BBB> fionag: oh yeah :)
[05:41] <BBB> fionag: all simd comes from horizontally doing the src, but dest is still scalar
[05:41] <BBB> fionag: I foresee some more speedup if that becomes simd'ed
[05:41] <BBB> anyway
[05:44] <cone-849> ffmpeg.git 03Michael Niedermayer 07master:5bca5f87d1a3: Revert "x86/videodsp: add emulated_edge_mc_mmxext"
[05:56] <BBB> so first linear I beat gcc by 2 cycles/sample on x86-64, thats not a lot
[05:56] <BBB> I was hoping for more
[05:57] <BBB> so Ill continue checking over the weekend
[05:57] <BBB> 32bit wont compile because Im not very familiar with x87 fp emu
[06:10] <BBB> oh my bad, I cant count; 32bit 66(gcc)->52 (yasm), 64bit 58(gcc)->48(yasm)
[06:10] <BBB> llvm was 66 (32bit) or 63 (64bit)
[06:11] <BBB> so thats a pretty big margin
[06:14] <BBB> pushed that to github also, please test on win64, avx and let me know if the speed measurements hold up this time
[06:14] <BBB> they should
[06:14] <BBB> but good to test
[06:14] <jamrial> win64 will probably be slightly slower than unix64 since you're using 15 gprs
[06:15] <jamrial> but still faster than inline nonetheless
[06:15] <BBB> well thats outside the loop
[06:15] <BBB> the loop iterates like 1000x or so, so the outside-loop-stuff is really very insignificant to overall cycle/sample performance
[06:15] <jamrial> true
[06:15] <BBB> (1000x per call)
[06:16] <BBB> but anyway, yes, what Im looking for is that win64 is significantly faster than inline asm with whatever compiler
[06:16] <BBB> and that it passes fate-swr
[06:16] <BBB> and that avx still works also ;)
[06:28] <jamrial> win64 (avx and not) work here
[06:29] <BBB> \o/
[06:30] <BBB> wanna speed-torture it?
[07:04] <fionag> BBB: my intuition would be that doing it the other way would be better, i.e. use N-wide SIMD to do N samples at once instead of N filter coefficients at once
[07:07] <fionag> so an M-wide filter would have an M-iteration loop that does N input samples, I think?
[07:20] <cone-849> ffmpeg.git 03Michael Niedermayer 07master:21bfed5b06f0: avcodec/mpegvideo_enc: reduce space between blocks in emu_edge in encode_mb_internal
[07:20] <cone-849> ffmpeg.git 03Michael Niedermayer 07master:504475f38ef0: avcodec/mpegvideo: dont overwrite emu_edge buffer
[07:21] <agentorange> hi so does anyone know why Im getting "missing picture in access unit with size ..."
[07:33] <BtbN> Please don't cross post question
[07:48] <agentorange> haha
[07:48] <agentorange> great answer
[10:23] <kurosu> jamrial / BBB: some times, I feel like we should have an equivalent to emms_c for x86_64
[10:23] <kurosu> for win64, store at the highest level the 10 callee-saved regs
[10:23] <kurosu> restore them when exiting (like emms_c)
[10:24] <kurosu> and never care for the callee-saved matter
[10:25] <kurosu> I haven't checked the disassembly, but seeing how hevc_mc declares fewer regs than needed and doesn't have any issue on win64
[10:26] <kurosu> it seems like this could be avoided if we now the compiler isn't going to use xmm6+ in the surrounding code (eg, no float)
[12:39] <cone-635> ffmpeg.git 03Vittorio Giovara 07master:f134b5ec53b4: apichanges: fill in changes for lavu 51.19 and 51.20
[12:39] <cone-635> ffmpeg.git 03Michael Niedermayer 07master:e925a82e2478: Merge commit 'f134b5ec53b4cb51cb69bf0c64de87687ea72b12'
[12:44] <michaelni> kurosu, a compiler could use xmm6+ in plain C code if it wanted, theres no gurantee that it doesnt, the same is of course true for mmx
[12:58] <ubitux> fate is a bit yellow
[12:59] <kurosu> michaelni, except it hardly ever does - but I agree this is not guaranteed
[13:00] <michaelni> ubitux, i fixed 2 fate issues before i went to bed
[13:01] <michaelni> some fate clients are still yellow from before that
[13:01] <ubitux> ah, ok
[13:02] <michaelni> kurosu, yes, we should just be carefull not to add code that makes assumtations about this and then forget about it, could be alot of work to debug
[13:03] <kurosu> michaelni, yeah, I didn't intend to actually try it
[13:03] <kurosu> I don't think we want to add this just for win64
[13:04] <kurosu> that was mostly a remark on how asm has to bother with it, when it sometimes actually doesn't matter
[13:47] <cone-635> ffmpeg.git 03Vittorio Giovara 07master:39975acc699c: rtpenc_jpeg: check for color_range too
[13:47] <cone-635> ffmpeg.git 03Michael Niedermayer 07master:823ea19a74ef: Merge commit '39975acc699c83af0a87a7318c0f41e189142938'
[13:47] <cone-635> ffmpeg.git 03Michael Niedermayer 07master:7448afc52b69: APIchanges: fillin some missing data
[13:47] <cone-635> ffmpeg.git 03Michael Niedermayer 07master:cc923727226b: doc/APIchanges: lengthening a hash to make it non ambigous
[13:50] <BBB> fionag: I want to do both, basically
[13:51] <BBB> fionag: so right now we do (in sse/avx) 4/8 filter samples per inner loop iteration, and then horizontally add the final sse/avx sum register to get the total sum in scalar for that one value
[13:52] <BBB> fionag: so Im thinking, why not do that inner loop 4/8x, so that the hadd can be over multiple registers (haddps) and the store is simd also
[13:52] <BBB> fionag: I think its more efficient that way by probably one or maybe several cycles/sample, which is pretty huge
[14:00] <BBB> also sent patches to ML, I think this is good enough to merge
[14:00] <BBB> we can work on top of this later
[15:41] <cone-635> ffmpeg.git 03Ronald S. Bultje 07master:ddb7b4435a84: swr: move dst_size == 0 handling outside DSP function.
[16:11] <cone-635> ffmpeg.git 03Michael Niedermayer 07master:a348f4befe23: avfilter/x86/vf_pullup: fix "invalid combination of opcode and operands" with nasm
[16:44] <BBB> weird...
[16:45] <BBB> michaelni: I have no idea what the build issue is, can you get more specific macro line numbers?
[16:56] <michaelni> when copy & paste inlining i get errors in xorps                         m0, m0 ; mulps                         m1, [filterq+min_filter_count_x4q*1] ; addps                         m0, m1 ; xorps                         m0, m0 ; xorps                         m2, m2 ; mulps                         m1, [filter1q+min_filter_count_x4q*1] ; addps                         m2, m3 ; addps                         m0, m1
[16:57] <michaelni> i wonder if theres a easier way to get line numbers from inside macros
[16:58] <BBB> maybe they need to be 3-operand?
[16:58] <BBB> can you try changing each instruction a, b to instruction a, a, b?
[16:59] <BBB> (in that sequence of things that fail)
[16:59] <BBB> weird that it works for me...
[17:00] <michaelni> seems to build with 3op
[17:00] <BBB> &
[17:01] <BBB> Ill re-submit
[17:04] <kurosu> BBB / michaelni: generally I do it in 2 steps
[17:04] <kurosu> call yasm only to preprocess the asm file, output the result
[17:04] <kurosu> filter out the debug/line number info
[17:04] <kurosu> compile
[17:04] <kurosu> see what the actual issue is
[17:05] <kurosu> could maybe be automated though some PREPROCESS=1 parameter to make
[17:06] <BBB> ok re-sent
[17:06] <BBB> kurosu: thatd be nice yes
[17:08] <michaelni> undoing macros automatically would indeed be very nice, and this is true for C too
[17:08] <BBB> benjamin once said that DEBUG=5 or so as a configure option would help unfold C macros nicely and give good debug info
[17:09] <BBB> like, there was some gcc option for it
[17:09] <BBB> I dont recall exactly
[17:27] <BBB> google is fun, I randomly wonder over this pastebin entry: http://pastebin.com/QZ2pP70F
[17:27] <BBB> so apparently touching someone elses code makes it yours& very interesting legal opinion, Im mildly shocked
[17:28] <nevcairiel> Should just ignore their twisted world
[17:37] <cone-635> ffmpeg.git 03Ronald S. Bultje 07master:faa1471ffcc1: swr: rewrite resample_common/linear_float_sse/avx in yasm.
[17:42] <BBB> yay \o/
[17:56] <iive> "gcc -E  Stop after the preprocessing stage; do not run the compiler proper.  The output is in the form of preprocessed source code, which is sent to the standard output."
[17:57] <iive> I think it used to be -P before, but at some version it was changed.
[20:49] <cone-635> ffmpeg.git 03Michael Niedermayer 07master:f9f8491ddf6d: avcodec/cavs: make cavs_chroma_qp non static
[20:49] <cone-635> ffmpeg.git 03Yao Wang 07master:546491667755: avcodec/cavs: improve conformance with rm52j reference decoder
[21:30] <cone-635> ffmpeg.git 03Michael Niedermayer 07master:c599c211fb71: avformat/mux: fix flush_packets flag with flushing buffers
[21:30] <cone-635> ffmpeg.git 03Michael Niedermayer 07master:e6aba1be4cb0: avformat/options_table: Fix flush_packet flag flags
[21:44] <cone-635> ffmpeg.git 03Martin Storsjö 07master:7b0c7c9163fe: arm: Detect 32 bit cpu features on ARMv8 when running on a 64 bit kernel
[21:44] <cone-635> ffmpeg.git 03Michael Niedermayer 07master:01983e50c024: Merge commit '7b0c7c9163fe3dd0081696befde28617119d2590'
[21:55] <cone-635> ffmpeg.git 03Michael Niedermayer 07master:4d3072ada3d5: doc/examples/muxing: remove unused variable
[00:00] --- Sun Jun 29 2014