[Ffmpeg-devel-irc] ffmpeg-devel.log.20170715

burek burek021 at gmail.com
Sun Jul 16 03:05:04 EEST 2017


[00:01:08 CEST] <Gramner> J_Darnley ported all the other recent x86inc changes from x264 but left that one out, I don't know if there was some issue or objection or whatever
[00:01:41 CEST] <nevcairiel> I think he was working up to that one
[00:02:05 CEST] <nevcairiel> and wanted to get the others all done first, especially because of the one cpuflag we had that x264 didnt have yet
[00:02:39 CEST] <Gramner> ah the aesni one. that has been pushed to x264
[00:03:34 CEST] <BBB> Gramner: fixed
[00:04:24 CEST] <BBB> its really just intended as a slightly updated version of https://wiki.videolan.org/X264_asm_intro/, because that was getting seriously old
[00:06:03 CEST] <Gramner> yeah, that was very much outdated
[00:35:19 CEST] <J_Darnley> Gramner: I don't remember why not.
[00:35:59 CEST] <J_Darnley> Perhaps I thought it a little pointless to port them when we didnt even support an assembler that could use them
[00:36:43 CEST] <J_Darnley> nevcairiel: ah yes
[00:36:48 CEST] <J_Darnley> I think I need to check which ones I pushed
[00:48:07 CEST] <JEEB> BBB: cheers for releasing the updated entry <3
[02:06:52 CEST] <BBB> Gramner: First thing I found out about x86inc.asm is that they put vzeroupper after the register pops in the epilog, which I am not sure is legal on Windows x64. <- is that true?
[02:07:56 CEST] <Gramner> I saw that comment and I have no idea what he's talking about. vzeroupper zeroes the upper parts of vector registers so it has nothing to do with how you pop registers
[02:10:17 CEST] <Gramner> maybe some sort of exception handling stack unrolling stuff that we don't care about?
[02:11:28 CEST] <iive> maybe he tries to tell that they should not be zeroed at all?
[02:11:45 CEST] <Gramner> windows have some complicated exception handling stack debugging stuff that's way to bothersome to care about in asm
[02:12:40 CEST] <Gramner> only the lower xmm parts of caller-saved xmm regs needs to be saved, anything above the first 128 bits are volatile
[02:14:22 CEST] <Gramner> callee-saved*
[02:15:37 CEST] <jamrial_> vzeroupper is also kinda slow. would be cool to have a way to prevent it being issued (other than adding INIT_XMM before RET) on functions where you know you're clearing the high 128 bits of every register
[02:16:11 CEST] <jamrial_> like those that end with HADDD, meaning the last bunch of instructions use xmm and implicitly clear said high bits
[02:16:22 CEST] <Gramner> clearing registers through other means than vzeroupper or vzeroall doesn't work. the cpu state still marks them as dirty in that case
[02:16:31 CEST] <jamrial_> i see
[02:16:46 CEST] <Gramner> avx-512 solves it though. at least on x64
[02:16:50 CEST] <jamrial_> ah well, whatever :p
[02:17:18 CEST] <Gramner> since you can freely use regs 16-31 without caring about legacy xmm state (since there are no legacy registers for those)
[02:18:15 CEST] <jamrial_> do vzeroupper and vzeroall work on them if issued, though?
[02:18:50 CEST] <jamrial_> i guess if there's an EVEX encoding of both they should...
[02:19:20 CEST] <Gramner> iirc no
[02:19:24 CEST] <atomnuker> at the start they shouldn't, the CPU zeroes them out for you for free apparently
[02:19:37 CEST] <atomnuker> (if you issue pxor or something else
[02:20:58 CEST] <iive> zeroupper 2 cycles on sandy bridge, strange, it is 4 microops
[02:20:59 CEST] <Gramner> vzeroupper is like 4/1 or something. it can be an issue in really short functions I guess, in longer ones it should be hidden by OoOE pretty well
[02:21:17 CEST] <Gramner> probably µarch-dependent
[02:21:23 CEST] <atomnuker> *at the start
[02:21:28 CEST] <atomnuker> (not in general)
[02:22:20 CEST] <jamrial_> i think i saw one arch in agner's pdf that mentioned one uop per half of a register. so 16 for vzeroupper and 32 for vzeroall
[02:22:54 CEST] <Gramner> vzeroall is super slow iirc. not really sure what the use case of it is
[02:23:32 CEST] <iive> jamrial_: on atom generations?
[02:23:35 CEST] <Gramner> I can't really thing about what scenario reuires you to zero all regs completely that is common enough to warrant it's own instruction
[02:23:49 CEST] <jamrial_> iive: no, looks like bulldozer
[02:23:56 CEST] <iive> that's amd
[02:24:31 CEST] <iive> you can count on them doing the wrong thing in simd. e.g. maskmov
[02:25:01 CEST] <iive> jamrial_: how is it on ryzen?
[02:25:03 CEST] <jamrial_> yeah, looks like vzeroupper is fine on intel but slow on amd, whereas vzeroall is slow on everything
[02:25:56 CEST] <jamrial_> iive: looks like the same as bulldozer
[02:29:49 CEST] <iive> maybe we should separate assembly by cpu family... because there are really major diffrences between amd and intel.
[02:31:01 CEST] <atomnuker> not worth it, they're not that dissimilar
[02:31:18 CEST] <iive> jamrial_: btw, do you have a bulldozer?
[02:31:19 CEST] <atomnuker> besides, then we'd get into the whole "intel-optimized ecosystem"
[02:32:20 CEST] <iive> the difference is cycles is quite major.
[02:33:05 CEST] <Gramner> bulldozer is the new p4. nobody really cares about it imho
[02:33:41 CEST] <iive> what are game consoles running on?
[02:34:02 CEST] <atomnuker> iive: they're still fast enough to not drop every single bit of asm code written
[02:34:42 CEST] <iive> atomnuker: you understood me wrong. You need a special crafted version for amd to run on optimal speed.
[02:34:51 CEST] <Gramner> ps4 and xbone uses jaguar
[02:35:41 CEST] <iive> btw, which one was where you make the code run faster, by placing nop between x87 ops?
[02:35:44 CEST] <Gramner> crafting special asm implementations for each µarch is not really worth it. way to hard to maintain
[02:36:00 CEST] <atomnuker> iive: atom
[02:36:19 CEST] <iive> atomnuker: i\m sure it was amd...
[02:36:53 CEST] <atomnuker> no, it was atom
[02:39:06 CEST] <iive> yes, it is atom
[02:39:13 CEST] <iive> appologies to AMD
[04:17:17 CEST] <cone-071> ffmpeg 03Michael Niedermayer 07master:d0ba0be35530: fate: add sub-srt-badsyntax test
[12:10:22 CEST] <wm4> rcombs: someone thinks you're MIA: http://ffmpeg.org/pipermail/ffmpeg-devel/2017-July/213643.html
[12:19:57 CEST] <ubitux> MIA? not sure it's matching the urban dictionary definition here&
[12:30:53 CEST] <nevcairiel> AWOL maybe =p
[12:31:31 CEST] <durandal_1707> what prediction ffv1 uses?
[12:45:13 CEST] <durandal_1707> atomnuker: i think finite state entropy, fse for short, could be base for ffv2 codec
[12:45:43 CEST] <durandal_1707> fse is faster than arithmetic coding
[12:49:41 CEST] <nevcairiel> what about the ANS thing that av1 now uses?
[12:50:13 CEST] <nevcairiel> its supposed to compress as well as arith  with a speed of huffmann
[12:52:00 CEST] <JEEB> durandal_1707: there are already ffv2 patches - http://akuvian.org/src/x264/
[13:14:21 CEST] <durandal_1707> nevcairiel: fse is ans
[13:48:47 CEST] <iive> BtbN: got some free time?
[13:50:54 CEST] <BtbN> yeah, just tell me which patch and which tests
[14:01:27 CEST] <iive> just a second
[14:02:08 CEST] <iive> https://patchwork.ffmpeg.org/patch/4270/
[14:11:06 CEST] <BtbN> All the variations again, or just some specific ones that were changed?
[14:13:56 CEST] <durandal_1707> JEEB: how  that compares to ffv1?
[14:15:25 CEST] <JEEB> no idea
[14:15:49 CEST] <JEEB> pengvado might have some answers if he appears out of the mist
[14:17:35 CEST] <BtbN> iive, https://bpaste.net/show/e0f70aedcff3 that's with master + patch, otherwise unmodified
[14:22:24 CEST] <iive> if this is with the same sample, then i think it is good news :D
[14:23:12 CEST] <BtbN> pretty much the exact same commandline
[14:24:22 CEST] <iive> yes, it is the same sample, it's in the log.
[14:25:10 CEST] <iive> try the defines in the .asm file. just the first 2.
[14:30:47 CEST] <BtbN> https://bpaste.net/show/a20324b995c9 CONST_IN_X64_REG_IS_FASTER 0
[14:33:25 CEST] <iive> nice, it is not dramatically faster anymore :) actually almost same speed.
[14:34:20 CEST] <BtbN> https://bpaste.net/show/3d753cd86f43 STALL_WRITE_FORWARDING 1
[14:37:49 CEST] <iive> that one is 7 cycles faster.
[14:38:37 CEST] <iive> i don't think this is worth it to have a special version just for ryzen.
[14:39:19 CEST] <iive> BtbN: may i ask you to do one more test. restore to the defaults and then in the .asm file, find the cmpss instruction.
[14:39:49 CEST] <iive> -        cmpless     xmm0, xmm3
[14:39:49 CEST] <iive> +        cmpps       xmm0, xmm0, xmm3, 1
[14:42:50 CEST] <jdarnley> BBB: I should get you to plug upipe as another project which uses x86inc
[14:43:30 CEST] <BtbN> https://bpaste.net/show/7dc86cc1a053
[14:44:06 CEST] <BBB> whats upipe?
[14:44:14 CEST] <BBB> is that your low-latency thing for high-bitrate video?
[14:45:06 CEST] <jdarnley> We use it for that yes.
[14:45:17 CEST] <jdarnley> https://github.com/cmassiot/upipe/tree/master/x86
[14:48:31 CEST] <mifritscher1> hi
[14:50:47 CEST] <durandal_1707> hi 
[14:54:58 CEST] <iive> BtbN: this one seems slower.
[14:58:23 CEST] <iive> BtbN: revert to default and run avx1 version. (cpuflags -avx2) I would guess it is still ~100 cycles faster. just want to be sure.
[14:58:40 CEST] <iive> i am probably not going to enable the avx2 version in the final code.
[15:03:57 CEST] <BtbN> https://bpaste.net/show/225d6e1113cd
[15:05:27 CEST] <iive> wow, it is a lot faster then the v3 version. Nice.
[15:07:03 CEST] <iive> it used to be around 1970 c , now it is around 1933 c .
[15:07:40 CEST] <iive> well, that's all the benchmarks i can think of.
[15:15:01 CEST] <iive> thank you very much :)
[15:39:58 CEST] <iive> atomnuker: i won't mind if you run few benchmarks, just to be sure everything is ok. and the cmpless->cmpps thing.
[16:56:23 CEST] <atomnuker> durandal_1707: ugh, its ANS based
[16:56:32 CEST] <atomnuker> no, the daala one is better
[16:56:47 CEST] <atomnuker> it can be SIMDd because its multisymbol
[16:57:00 CEST] <atomnuker> and neither are any better in terms of compression
[16:57:13 CEST] <atomnuker> they're both extremely close to the entropy limit
[16:59:32 CEST] <atomnuker> also ANS encoding can be quite expensive
[16:59:59 CEST] <atomnuker> it has to reverse all bits in a second pass in the entire packet
[18:06:58 CEST] <durandal_1707> atomnuker: so is there lib of this daala stuff as encoder and decoder?
[18:13:54 CEST] <atomnuker> durandal_1707: https://github.com/atomnuker/FFmpeg-FFA1/blob/master/libavcodec/daala_entropy.c%
[18:14:01 CEST] <atomnuker> https://github.com/atomnuker/FFmpeg-FFA1/blob/master/libavcodec/daala_entropy.c
[18:14:44 CEST] <atomnuker> encoding and decoding, mostly rewritten ant optimized daala entropy coding functions
[18:14:48 CEST] <atomnuker> *and
[18:17:08 CEST] <atomnuker> ff_daalaent_encode_generic/ff_daalaent_decode_generic is what's SIMDable and what's going to be called the most
[18:17:58 CEST] <atomnuker> its what does adaption on a per-call basis
[19:42:40 CEST] <JEEB> huh
[19:43:07 CEST] <JEEB> with a mpeg-ts stream I have on hand avformat_find_stream_info actually ends up having memory "still reacahable" according to valgrind
[19:43:13 CEST] <JEEB> with an mp4 file it doesn't happen
[19:43:36 CEST] <JEEB> I used `valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --trace-children=yes`
[19:44:14 CEST] <JEEB> http://up-cat.net/p/a4ba862e
[19:44:40 CEST] <JEEB> of course since it's a one-off it's not too bad
[19:54:00 CEST] <nevcairiel> The lock manager leaking is kinda normal
[19:55:05 CEST] <JEEB> ok
[19:57:22 CEST] <wm4> yeah
[19:57:30 CEST] <wm4> also the lock manager shouldn't exist
[19:57:36 CEST] <wm4> but that's still a far goal
[19:57:53 CEST] <nevcairiel> we already have the solution to that, just needs a lot of work :p
[19:58:08 CEST] <kierank> Does it protect more than static tables?
[19:58:53 CEST] <nevcairiel> no
[19:59:15 CEST] <nevcairiel> i believe its not even used if the codec is marked as threadsafe
[19:59:22 CEST] <nevcairiel> which might explain why JEEB may not see it with some file
[20:00:08 CEST] <JEEB> yea
[20:00:17 CEST] <JEEB> the mp4s and mkvs with H.264 and AAC don't get it
[20:00:32 CEST] <JEEB> the MPEG-TS with MPEG-2 Video, AAC and H.264 all mixed gets it
[20:00:50 CEST] <nevcairiel> its probably mpeg2 then
[20:01:08 CEST] <wm4> hm I should propose a patch to deprecate the codec register functions
[20:01:15 CEST] <wm4> and call them automatically instead
[20:27:00 CEST] <JEEB> oh, so input in lavf has close, but output doesn't? interestingf
[20:27:14 CEST] <JEEB> just expected those two to be symmetrical
[20:33:13 CEST] <JEEB> oh, right. misread the name of the function. that's why
[21:58:09 CEST] <faLUCE> Hello. Which is the correct form for this instruction?  1)  av_opt_set(mMuxerContext->priv_data, "mpegts_flags", "pat_pmt_at_frames", 1);    OR 2) av_opt_set(mMuxerContext->priv_data, "pat_pmt_at_frames", "1", 0);
[22:18:03 CEST] <wm4> I don't think priv_data is a public field, but haven't checked
[22:21:39 CEST] <TD-Linux> atomnuker, fwiw ANS can be multisymbol too (rANS)
[22:30:46 CEST] <atomnuker> rans's downside gets particularly bad with the huge packet sizes of lossless video
[22:35:06 CEST] <JEEB> were refcounted frames limited to decoding?
[22:36:49 CEST] <atomnuker> no, pretty sure you can give encoders refcounted frames
[22:42:38 CEST] <JEEB> well I'm just having an init() function and I wondered if I wanted to put the av_opt_set_int of refcounted_frames only to a decoder or not
[22:43:01 CEST] <JEEB> of course they are usable with encoders as well :)
[22:43:12 CEST] <JEEB> although wait, that doesn't make sense
[22:43:31 CEST] <JEEB> as in, of course they don't need the flag since encoders don't output AVFrames :P
[22:51:13 CEST] <nevcairiel> encoders might potentially store multiple input frames
[22:51:21 CEST] <nevcairiel> although not sure if any make use of that with refcounting
[22:51:41 CEST] <nevcairiel> the option wouldnt make a difference for them, though
[22:52:59 CEST] <JEEB> yeah
[22:53:10 CEST] <JEEB> I've been looking at the screen a bit too much recently I guess :P
[00:00:00 CEST] --- Sun Jul 16 2017


More information about the Ffmpeg-devel-irc mailing list