[Ffmpeg-devel-irc] ffmpeg-devel.log.20170620

Wed Jun 21 03:05:02 EEST 2017

[00:02:29 CEST] <nevcairiel> Gramner: now that the intel nda is officially over, do you have some numbers how much avx512 helps for x264? i would guess some single digit percentages at best?
[00:03:37 CEST] <hanna> `master` fails building: make: *** No rule to make target 'libavcodec/x86/simple_idct.c', needed by 'libavcodec/x86/simple_idct.o'.  Stop.
[00:04:11 CEST] <durandal_1707> make clean
[00:04:17 CEST] <iive> hanna: make distclean
[00:04:19 CEST] <durandal_1707> make distclean
[00:04:20 CEST] <Gramner> 5-10% faster overall vs avx2 on preset veryfast on a cpu with a single 512-bit ALU. note: on my early ES with low freqencies etc., don't know how well that correlates to retail cpu:s
[00:04:27 CEST] <durandal_1707> configure
[00:04:42 CEST] <Gramner> also it's not finished. still stuff left to optimize
[00:05:11 CEST] <hanna> Ah, that fixes it (I ran `git clean -fdx`)
[00:05:23 CEST] <J_Darnley> BBB: noted and tyvm.  How many beers do we owe you yet?
[00:05:41 CEST] <J_Darnley> Enough to die from alcohol poisoning I bet.
[00:06:32 CEST] <wm4> so, why does nobody fix that build failures like this can happen?
[00:06:44 CEST] <BBB> its ok, I hear france has much better medical care than me
[00:06:47 CEST] <BBB> wm4: what happened?
[00:06:54 CEST] <nevcairiel> wm4: because make is dumb
[00:06:58 CEST] <wm4> BBB: see hanna's comment
[00:07:06 CEST] <BBB> hanna: rm -f libavcodec/x86/simple_idct.d
[00:07:14 CEST] <BBB> make it unfortunately kind of stupid, yes
[00:07:15 CEST] <wm4> nevcairiel: never happens to me with my own makefile for my own project (also has "auto" deps)
[00:07:31 CEST] <nevcairiel> if its in a .d file, make wont ignore it
[00:07:50 CEST] <jamrial> wm4: do you often replace c files with asm files of the same name? because that's what generated that error :p
[00:08:02 CEST] <J_Darnley> My bad
[00:08:04 CEST] <BBB> it doesnt have to be the same name
[00:08:05 CEST] <wm4> jamrial: it also happens if you remove files
[00:08:07 CEST] <J_Darnley> and yes
[00:08:08 CEST] <jkqxz> You're basically meant to clean after any update with the system we have.  It's just that it usually doesn't break, because people rarely remove files.
[00:08:09 CEST] <BBB> if its a different name, it still happens
[00:08:20 CEST] <cone-329> ffmpeg 03Michael Niedermayer 07master:ae6f6d4e34b8: avcodec/x86/mpegvideo: Use intra scantable in dct_unquantize_h263_intra_mmx()
[00:09:00 CEST] <BBB> michaelni: any idea why it used the inter scantable in h263_intra_mmx/c?
[00:09:14 CEST] <BBB> also how about sending patches
[00:09:40 CEST] <BBB> (note it does the same thing for C code also)
[00:09:51 CEST] <J_Darnley> Hm.  Does that obselete one of ours?
[00:10:04 CEST] <BBB> line 228 in mpegvideo.c:
[00:10:06 CEST] <BBB>         nCoeffs= s->inter_scantable.raster_end[ s->block_last_index[n] ];
[00:10:19 CEST] <BBB> note inter_scantable use in function dct_unquantize_h263_intra_c()
[00:10:22 CEST] <BBB> michaelni: ^^
[00:10:33 CEST] <TMM> crap, I accidentally merged two commits
[00:10:36 CEST] <jamrial> curious fate wasn't affected. guess this code isn't tested?
[00:10:36 CEST] <nevcairiel> Gramner: i'll definitely give it a try on a retail cpu, if you're interested in numbers. presumably the 10-core has 2 AVX512 ports as well,  instead of just one like the 6 and 8 cores
[00:10:55 CEST] <michaelni> BBB, i noticed, i changed it and i forgot to safe the file so it ended up not in the commit
[00:11:02 CEST] <BBB> michaelni: ah, ok
[00:11:13 CEST] <BBB> should send patch for review :-p
[00:11:18 CEST] <BBB> anyway
[00:11:23 CEST] <jamrial> BBB: git blame blames d50635cd247
[00:11:28 CEST] <J_Darnley> TMM: check `git reflog` then `git reset`
[00:13:28 CEST] <BBB> michaelni: I think the confusing thing for me is: why is mpeg1/2 using the intra scantable for quantize_inter as well as intra, and why is (was) h263 using the inter scantable for quantize_intra as well as inter?
[00:13:42 CEST] <michaelni> BBB either it was a copy / paste mistake that it was inter or it was an attempt to reduce data cache use but as its wrong it doesnt t realy metter
[00:13:46 CEST] <BBB> michaelni: does that mean anything? or are they all typos?
[00:13:59 CEST] <BBB> ok, that clears things up
[00:14:05 CEST] <BBB> is the mpeg1 thing also buggy then?
[00:14:25 CEST] <BBB> e.g.         int j= s->intra_scantable.permutated[i]; in dct_unquantize_mpeg1_inter_c():92
[00:14:31 CEST] <michaelni> for mpeg1 inter=intra scantable so it should be fine
[00:14:36 CEST] <BBB> aha, ok
[00:14:43 CEST] <BBB> mpeg2 also?
[00:17:19 CEST] <Gramner> nevcairiel: yeah, feel free to test. I'll be replacing my ES with a retail version as well later
[00:18:34 CEST] <michaelni> BBB, yes mpeg2 also has intra == inter, it can be zigzag or alt scan depending on a bit in picture coding ext
[00:18:44 CEST] <BBB> okay
[00:19:11 CEST] <BBB> tnx
[00:21:34 CEST] <BBB> J_Darnley: so Im guessing that (after the C function is updated), that patch can be dropped
[00:21:42 CEST] <TMM> J_Darnley, thans
[00:22:01 CEST] <J_Darnley> Mm.  I will check though before discarding
[00:24:07 CEST] <BBB> J_Darnley: good idea, tnx
[00:43:57 CEST] <wm4> opinions? https://haasn.xyz/files/ffmpeg-threads/img.png
[00:44:59 CEST] <iive> scan order should not depend on inter/intra, however normal/alt is used for progressive/interlaced  pictures
[00:45:04 CEST] <iive> however
[00:45:18 CEST] <iive> quant tables does differ for inter/intra.
[00:45:37 CEST] <Compn> speedup defined as ? 
[00:46:48 CEST] <nevcairiel> wm4: that graph needs more context, are there actually 32 cpu threads, or is it just flodding the cpu with more work
[00:47:01 CEST] <wm4> hanna: please provide context lol
[00:47:12 CEST] <hanna> 16 core CPU with hyperthreading
[00:47:26 CEST] <wm4> nevcairiel: these are libavcodec threads
[00:47:31 CEST] <hanna> speedup is defined as frames per second, normalized to single thread being the baseline
[00:47:33 CEST] <wm4> AVCodecContext.threads or something
[00:47:36 CEST] <hanna> ffmpeg -threads
[00:47:49 CEST] <TMM> I really don't know how to git it seems
[00:47:58 CEST] <hanna> hmm maybe I should have normalized it so that 1.0 is realtime
[00:48:02 CEST] <hanna> that could have been more interesting in retrospect
[00:48:12 CEST] <nevcairiel> wm4: i assumed that, but for the graph to make sense you still need to know how many threads the cpu actually can run at the same time
[00:48:29 CEST] <wm4> right
[00:48:30 CEST] <hanna> but since some of these clips inherently decode way faster than others (for some reason) it would also obscure the trend
[00:48:48 CEST] <nevcairiel> the pink one is rather obscure with its "levels"
[00:48:53 CEST] <hanna> I wonder what's up with the pink one yeah
[00:49:12 CEST] <nevcairiel> perhaps enough to get a new reference frame into the decoder earlier
[00:49:16 CEST] <hanna> I didn't go to great lengths to make sure the system had no other load at the time, so the numbers jittering occasionally makes sense
[00:49:23 CEST] <nevcairiel> more threads = more "look ahead"
[00:49:25 CEST] <hanna> but that seems too regular and weird to be just jitter
[00:49:52 CEST] <nevcairiel> if it can work on more I frames in parallel (which are the most expensive), you can probably get more speed that way
[00:50:02 CEST] <hanna> that would make sense
[00:50:04 CEST] <nevcairiel> but yeah HEVC decode doesnt scale all that well
[00:50:23 CEST] <hanna> I bet if you look up the I frame distribution in that pink clip it would match the number of threads needed per step
[00:50:41 CEST] <nevcairiel> at one point someone was suggesting a combined frame/slice decode mode to speed up costly intra frames further
[00:50:48 CEST] <hanna> I would if it wouldn't make more sense to load an I-frame per thread and have each thread do the P/B frames on its own
[00:51:06 CEST] <hanna> or maybe those pink steps are P frames and not I frames
[00:51:29 CEST] <hanna> either way, parallelize the ones that are the hardest and generate the ones that are easy on-demand
[00:51:59 CEST] <nevcairiel> the threading is way too rigid for that
[00:52:11 CEST] <hanna> maybe with some sort of hierarchical thing: each contiguous group of B-frames gets a single thread
[00:52:14 CEST] <iive> hanna: at least with x264, the threads used to encode the video translated quite nicely to the threads used to decode the video
[00:52:23 CEST] <atomnuker> Gramner: are there many places where avx512 can help x264?
[00:52:31 CEST] <atomnuker> with blocks that small...
[00:52:43 CEST] <nevcairiel> something like the idea  above where you use slice thread to decode intra frames faster may help scaling on such clips
[00:52:49 CEST] <iive> because the encoding was done in parallel and thus, less dependencies between the threads
[00:53:07 CEST] <nevcairiel> especially if a clip is encoded with wavefront parallel processing
[00:53:07 CEST] <hanna> it's possible the jump in the organge test was also such a phenomenon
[00:53:14 CEST] <Gramner> a decent amount, yes. obviously a lot less than more modern video formats though
[00:53:42 CEST] <iive> hanna: also, in the graph it seems that the speed saturates somewhere around 16 cores, that is also the number of physical cores.
[00:53:54 CEST] <nevcairiel> thats no surprise
[00:55:12 CEST] <wm4> so I assume libavcodec frame threading is not advanced enough to decode multiple GOPs in parallel given enough worker threads?
[00:55:31 CEST] <hanna> iive: I'm not sure if I'm seeing that
[00:55:34 CEST] <hanna> for the blue clip yes
[00:55:38 CEST] <hanna> for the orange it keeps going up
[00:55:42 CEST] <hanna> for the green it saturates way earlier
[00:55:43 CEST] <nevcairiel> it'll decode as many frames as you have threads
[00:55:45 CEST] <hanna> for the pink it also keeps going up
[00:56:02 CEST] <nevcairiel> but of course it doesn't try to be s mart about it
[00:56:12 CEST] <nevcairiel> it just decodes the next X frames
[00:56:51 CEST] <hanna> it's actually technically two 8 core CPUs with NUMA in between them, I would expect that to maybe introduce some sort of extra overhead after 8 cores
[00:57:07 CEST] <wm4> I'm wondering whether it'd make sense to make things async (with packet and frame queues) to reuse the worker threads faster or so
[00:57:34 CEST] <hanna> wm4: the most important speedup would be to have lavc decode directly to GPU-mapped buffers
[00:57:38 CEST] <hanna> instead of doing the extra memcpy()
[00:57:47 CEST] <iive> hanna: it does keep going up, but at much smaller rate. basically +2 speed for 16 more threads.
[00:57:51 CEST] <hanna> you could steal atomnuker's code from his vulkan branch for this
[00:57:52 CEST] <wm4> hanna: that's possible with the API
[00:57:55 CEST] <hanna> that one helps a _lot_
[00:58:07 CEST] <nevcairiel> doubtful that has a huge impact, since if an early thread is still busy any following frames will likely depend on it
[00:58:12 CEST] <hanna> and would almost surely make your frame queue thing irrelevant
[00:58:12 CEST] <wm4> but we're stuck with OpenGL, so...
[00:58:18 CEST] <hanna> wm4: OpenGL can do this too
[00:58:21 CEST] <hanna> that's what PBOs are for, right?
[00:58:33 CEST] <iive> hanna: are you sure you are not going to hit PCIE limit?
[00:58:43 CEST] <wm4> hanna: the decoder needs to be able to read from the frames even while the renderer uses them
[00:58:57 CEST] <wm4> classic PBOs AFAIK can't do this, altrhough newer GL things can
[00:58:58 CEST] <iive> the problem is that frames are used as reference and need to be read back.
[00:59:11 CEST] <wm4> (and then you have to do synchronization manually)
[00:59:42 CEST] <wm4> atomnuker: when you tried with vulkan, how much did marking get_buffer2 as thread-safe help?
[00:59:45 CEST] <nevcairiel> in my experience the upload to the gpu doesnt really limit that many things, you can spin that off into a different thread and dont block the decoder
[01:00:07 CEST] <iive> also, gpu can use dma to upload
[01:00:19 CEST] <hanna> iive: suppose each frame needs to be written to once and read back once, that gives us an approximate bandwidth consumption for 4K 60 Hz 10-bit 4:2:0 content of 7.5 Gbps
[01:00:28 CEST] <hanna> iive: actually
[01:00:48 CEST] <iive> hanna: with 16 ref frames for h264... you can do a lot of reading.
[01:00:57 CEST] <hanna> iive: I don't think this goes through PCIe at all
[01:01:16 CEST] <hanna> GPU mapped buffers are still in system RAM, but the GPU DMA engine can access them
[01:01:28 CEST] <wm4> hanna: could also be that we should try to pipeline upload better
[01:01:39 CEST] <hanna> what happens under the hood is a vkCmdCopyBufferToImage call (or w/e it's called) which copies out from the VkBuffer (on system memory) into the VkImage (on GPU memory)
[01:01:41 CEST] <hanna> using the DMA engine
[01:01:51 CEST] <hanna> but this is free CPU-wise (no memcpy)
[01:02:04 CEST] <iive> hanna: hum, I though that the dma is not limited to specific range
[01:02:16 CEST] <iive> anymore
[01:02:19 CEST] <atomnuker> wm4: significantly with 8k30 and 4k60
[01:02:24 CEST] <hanna> whereas what happens in the vo_opengl example currently is: 1. decode to buffer malloc()'d by lavc, 2. memcpy() into buffer malloc'd by OpenGL driver, 3. DMA engine copies out from this driver-allocated buffer
[01:03:04 CEST] <wm4> hanna: we also try to use the texture immediately after issuing the copy, which might not be ideal
[01:03:12 CEST] <wm4> but who knows how many other pipeline stalls there are
[01:03:22 CEST] <hanna> most OpenGL drivers optimize client-side pixel transfers by copying the data to internal memory anyway.  from the opengl wiki page on PBOs
[01:03:25 CEST] <nevcairiel> unless your system is super low powered, the s ingle memcpy shouldn't really hurt that many things,  spin it into its own thread to not block the decoder, and it should be fine
[01:03:39 CEST] <atomnuker> though if you can map a buffer with a device memory vkcmdcopybuffertoimage is free :)
[01:03:46 CEST] <wm4> hanna: yeah, still brought a speedup in some situations though (which is ????)
[01:03:53 CEST] <atomnuker> even better on intel where gpu memory == system memory
[01:04:06 CEST] <nevcairiel> for the upload itself, NVIDIA actually has dedicated copy engines for OpenGL PBOs, and you can use fences to get signaled when its done uploading to avoid a stall in the render pipeline
[01:04:17 CEST] <hanna> nevcairiel: the memcpy() is in the VO thread which can block for longer than the duration of a vsync when the machine needs to access memory from the other CPU iirc
[01:04:27 CEST] <hanna> and also for random reasons otherwise
[01:04:36 CEST] <hanna> forcing all allocations to come from the same NUMA node helps but doesn't completely eliminate it
[01:04:42 CEST] <nevcairiel> then move it out of that thread, why does it have to be in there? =p
[01:04:47 CEST] <hanna> it also takes a non-trivial amount of time overall
[01:04:47 CEST] <wm4> hanna: so I think we should try uploading a few frames ahead, with dedicated PBOs, which should be pretty simple with the infrastructure for interpolation
[01:04:58 CEST] <hanna> nevcairiel: that's actually what I did in my vo_vulkan test, I had a thread dedicated to the memcpy()ing
[01:05:02 CEST] <hanna> but it's annoying for many reasons
[01:05:13 CEST] <hanna> wm4: possible
[01:05:19 CEST] <hanna> wm4: I mean we essentially already do this for interpolation
[01:05:29 CEST] <nevcairiel> uploading PBOs is actually one of the only things that you can properly multi-thread even in OpenGL .. and GL otherwise hates any sort of threading
[01:05:32 CEST] <atomnuker> wm4: btw the vo.h changes allow VOs to flag for threadsafe buffer using VO_CAP_GET_BUFFER_THREADSAFE
[01:05:33 CEST] <wm4> hanna: interpolation also uses the texture immediately
[01:05:35 CEST] <hanna> although interpolation uploads and then immediately accesses them
[01:05:37 CEST] <hanna> yeah
[01:05:41 CEST] <wm4> hanna: we just need to add an "offset"
[01:06:30 CEST] <hanna> ;Most of what you gain is the ability to load data directly into the PBO itself, which means that OpenGL won't need to copy it. You may even be able to stream data directly from disk into a mapped buffer.
[01:06:34 CEST] <hanna> this is literally what we would need to do
[01:06:44 CEST] <hanna> wm4: I'm pretty sure we just need to copy/paste the vo_vulkan code from atomnuker's branch
[01:06:50 CEST] <hanna> 1. allocate PBOs, 2. give these to vd_lavc, let it decode into them
[01:07:16 CEST] <hanna> this avoids the memcpy() and everybody's happy
[01:07:25 CEST] <wm4> hanna: feel free to try
[01:07:26 CEST] <hanna> there's also some CUDA shit we could do
[01:07:34 CEST] <hanna> I hate OpenGL :p
[01:07:47 CEST] <wm4> well yeah maybe it's easier to use cuda surfaces
[01:08:01 CEST] <hanna> CUDA lets you basically take an existing memory region (not allocated by the driver) and make it GPU-visible
[01:08:04 CEST] <hanna> so the DMA engine can then copy out of it
[01:08:04 CEST] <wm4> although these requires weird on-GPU memcpys
[01:08:14 CEST] <hanna> but iirc OpenGL itself does not inherently let you do this
[01:08:23 CEST] <wm4> (or maybe this case is too hardcoded for ffmpeg cuvid)
[01:08:23 CEST] <hanna> you have to let the driver allocate the buffer for you
[01:08:26 CEST] <nevcairiel> you need to keep the PBOs mapped for way long, OGL may not like that
[01:08:36 CEST] <wm4> nevcairiel: yes, that needs newer stuff
[01:08:45 CEST] <TMM> goddamnit, whenever I rebase edit one of the commits I end up with a merge conflict and when I resolve it git merges the two commits
[01:08:58 CEST] <wm4> and the "do it correctly or the driver will abort()" kind of manual synchronization
[01:09:21 CEST] <hanna> https://en.wikipedia.org/wiki/CUDA_Pinned_memory this thing iirc
[01:09:22 CEST] <wm4> TMM: you're doing something wrong
[01:09:30 CEST] <TMM> wm4, I have no doubt :)
[01:09:47 CEST] <wm4> hanna: why does it have a wikipedia page
[01:10:05 CEST] <nevcairiel> with classic PBOs the problem is that it would only flush it to the GPU when you unmap it, and you can only unmap it when the frame is removed from the DPB ... which would be way too late. Not sure what these more modern extensions really do
[01:10:09 CEST] <hanna> Non-locked / non-pinned memory does not reside only in main memory (e.g. it can be in swap), so the driver needs to access every single page of the non-locked memory, copy it into pinned buffer and pass it to the Direct Memory Access (DMA) (synchronously page-by-page copy).
[01:10:17 CEST] <hanna> ^ this is the situation with OpenGL as well
[01:10:24 CEST] <hanna> the opengl driver copies it into pinned memory
[01:10:40 CEST] <hanna> which is why we can't just take vd_lavc's char* and somehow give it to the GPU, it's not pinned
[01:10:42 CEST] <wm4> nevcairiel: the modern extensions (and maybe standard in one of the 4.x versions) let you keep it mapped, somehow
[01:10:54 CEST] <hanna> and may be paged, moved around, compressed, swapped etc. by the kernel
[01:10:56 CEST] <hanna> at any time
[01:11:52 CEST] <hanna>  To allocate page-locked memory on the host in CUDA language one could use cudaHostAlloc.[3]
[01:12:12 CEST] <hanna> 03>Allocates size bytes of host memory that is page-locked and accessible to the device. The driver tracks the virtual memory ranges allocated with this function and automatically accelerates calls to functions such as cudaMemcpy().
[01:12:36 CEST] <hanna> but then how to import it as an OpenGL texture? lol
[01:13:57 CEST] <hanna> https://github.com/nvpro-samples/gl_cuda_interop_pingpong_st seems there is an interop API
[01:14:36 CEST] <nevcairiel> cuda gl interop exists, but its a bit iffy
[01:15:33 CEST] <hanna> https://www.cs.cmu.edu/afs/cs/academic/class/15668-s11/www/cuda-doc/html/group__CUDA__GL_g52c3a36c4c92611b6fcf0662b2f74e40.html
[01:15:51 CEST] <TMM> oh, I think it's because I committed the patches not rebased them
[01:16:01 CEST] <hanna> and then maybe you can use the cuda memcpy thing to copy from the cudaHostAlloc'd buffer to this registered image
[01:16:11 CEST] <hanna> TMM: `rebase -i` is your friend
[01:16:54 CEST] <nevcairiel> cudaHostAlloc docs warn about using it to allocate too much memory, but who knows how much is "too much" :D
[01:17:13 CEST] <rcombs> this reminds me, is it generally possible to convert/remap a buffer between "regular" (WB) and "video" (WC) memory on Intel?
[01:28:16 CEST] <iive> rcombs: in theory these are controlled by registers or page table attributes
[01:28:34 CEST] <iive> i mean, mtrr or pta
[01:40:19 CEST] <hanna> nevcairiel: probably if the kernel can no longer find contiguous memory regions big enough to fit your allocations
[01:40:30 CEST] <hanna> since pinning means it has to be contiguous in physical memory, right?
[01:41:14 CEST] <philipl_> hanna: mpv already uses the cuda GL interop to transfer cuvid decoded frames in vo_opengl
[01:41:45 CEST] <hanna> philipl: oh neat, so in theory we could piggy-back on top of this as a new hwdec type that uses libavcodec but allocates frames using cudaHostAlloc instead?
[01:42:35 CEST] <hanna> also why is there commented-out code in hw_cuda.c
[01:43:06 CEST] <hanna> oh, that entire file is for the old CUDA API
[01:43:11 CEST] <hanna> whatever that means
[01:44:23 CEST] <philipl> Look at the hwdec implementation to see how it's used. It's all very simple in that context.
[01:44:24 CEST] <hanna> Oh, the code I'm looking for is hwdec_cuda.c
[01:44:29 CEST] <philipl> Yeah
[01:44:52 CEST] <philipl> So, I don't know what you're actually trying to do. You want to optimize memcpy for software decoding?
[01:45:15 CEST] <philipl> If it's in the context of some vo_vulkan with cuvid, nvidia have yet to explain the right way to do gpu side buffer sharing.
[01:45:38 CEST] <philipl> There's no vulkan/cuda interop and vk_external_memory seems incomplete without a cuda side api
[01:45:43 CEST] <hanna> philipl: No, the context has nothing to do with vulkan and nothing to do with cuvid
[01:45:57 CEST] <hanna> philipl: I want to eliminate the memcpy() from vo_opengl
[01:46:11 CEST] <philipl> Ok.
[01:46:36 CEST] <hanna> the idea is to make vd_lavc use cudaHostAlloc() instead of malloc() and then replace memcpy() by cuMemcpy2D()
[01:46:51 CEST] <hanna> (which uses the GPU's DMA engine instead of the host's RAM bandwidth and CPU time)
[01:47:14 CEST] <hanna> well it still uses RAM bandwidth but less
[01:47:26 CEST] <hanna> 1x read instead of 1x read + 1x write + 1x read
[01:48:58 CEST] <hanna> philipl: in theory it should be possible to re-use this code with minor modifications; it wouldn't really be a `hwdec` though, more of a `cuda_swdec`
[01:49:13 CEST] <hanna> maybe --hwdec=cuda-memcpy
[04:55:38 CEST] <philipl_> hanna`: neat. I'd be interested to see how much difference it makes.
[10:05:07 CEST] <cone-165> ffmpeg 03wm4 07master:f1df7cc10c62: ffmpeg: remove misleading and incorrect warning messages
[10:24:20 CEST] <ubitux> why do we have an AVProgram.program_num and an AVProgram.id?
[10:32:27 CEST] <wm4> doesn't only the .ts demuxer use this
[10:34:07 CEST] <nevcairiel> basically, yes
[10:34:22 CEST] <nevcairiel> hls also uses it for variant streams, but it doesnt set any of those fancy fields
[11:13:02 CEST] <wm4> hm did merges completely stop?
[11:18:36 CEST] <ubitux> someone needs to merge the dash stuff
[11:18:47 CEST] <ubitux> and i really wasn't motivated to do it
[11:19:43 CEST] <wm4> libav really made dash changes we don't have?
[11:21:23 CEST] <ubitux> yes, afaik
[11:21:26 CEST] <ubitux> and the other way around
[11:21:30 CEST] <ubitux> it's clashing
[11:21:45 CEST] <ubitux> feel free to look at this, because i won't
[11:27:35 CEST] <wm4> yeah those changes look pretty wild
[11:36:48 CEST] <durandal_1707> can i push random libav patches? not as merge, not as cherrypick.?
[11:37:50 CEST] <nevcairiel> they kinda are either one or the other, are they not? :p
[11:42:26 CEST] <wm4> I'd say that is ok, as long as the merges are trivial enough and pass fate
[11:42:42 CEST] <wm4> but mark them as merges in the commit msg
[11:44:00 CEST] <iive> patches should go through review, either in ffmpeg or in libav.
[11:45:02 CEST] <ubitux> durandal_1707: please reference the original commit hash from Libav
[11:45:06 CEST] <ubitux> it helps to noop the merge
[11:45:17 CEST] <ubitux> and document the differences if appropriate
[11:45:19 CEST] <nevcairiel> (so basically, do a cherry-pick)
[11:45:22 CEST] <ubitux> yeah.
[11:45:29 CEST] <ubitux> cherry-pick -x
[11:45:58 CEST] <ubitux> btw, we're again at 400+ commits to merge
[11:59:44 CEST] <cone-165> ffmpeg 03Diego Biurrun 07master:155f071bad5a: build: Add missing idctdsp dependency for clearvideo
[12:13:24 CEST] <wm4> ubitux: any specific format we should use when referencing libav commit hashes?
[12:13:56 CEST] <ubitux> cherry-pick -x is fine, i usually just --grep the hash in the log
[12:14:11 CEST] <ubitux> as long as it's present in the log history, it's fine
[12:14:16 CEST] <ubitux> preferably not shorten
[13:00:01 CEST] <J_Darnley> BBB: did you discover anything with the mov file?
[13:16:00 CEST] <J_Darnley> Did anyone have any objections against me pushing the first 2 patches of my set?  The two for avcodec/x86/mpegvideoenc_template.c?
[13:27:42 CEST] <BBB> J_Darnley: if its approved, push it
[13:28:10 CEST] <BBB> J_Darnley: havent looked at mov file yet, just woke up, need to have bath and get to work, then Ill look
[13:28:21 CEST] <J_Darnley> That's fine
[13:29:33 CEST] <cone-165> ffmpeg 03James Darnley 07master:fa30a0a54854: avcodec/x86/mpegenc: check IDCT permutation type is a valid value
[13:29:34 CEST] <cone-165> ffmpeg 03James Darnley 07master:e3db94302c79: avcodec/x86/mpegenc: support transpose permuation type
[13:35:43 CEST] <BBB> J_Darnley: can you also push some of the clean-up patches to the 10-bit assembly template file?
[13:35:50 CEST] <BBB> or is there a patch missing approval in that series?
[13:36:05 CEST] <J_Darnley> I'll check
[13:39:14 CEST] <J_Darnley> You've signed off on 2 of 4 so I can push those
[13:40:00 CEST] <TMM> It seems that line length, comment style and other formatting style things in all the MVE related files are kind of all over the place
[13:40:09 CEST] <TMM> should I submit a patch that normalizes all of that?
[13:40:32 CEST] <TMM> some of the old comments are also super outdated, even compared to after my changes
[13:40:43 CEST] <TMM>  err compared to *before* my changes
[13:42:24 CEST] <cone-165> ffmpeg 03James Darnley 07master:8781330d80e3: avcodec/x86: cleanup simple_idct10
[13:42:25 CEST] <cone-165> ffmpeg 03James Darnley 07master:d2597fb0c1c8: avcodec/x86: modify simple_idct10 macros to add an action paramter
[13:57:15 CEST] <BBB> I just reviewed 8/9
[13:57:18 CEST] <BBB> 6/7 already pushed
[13:57:25 CEST] <BBB> 1/2 also pushed
[13:57:40 CEST] <BBB> 5 not needed
[13:57:44 CEST] <BBB> 3-4 and 10-11 still todo?
[13:57:57 CEST] <BBB> 3-4 is mdec...
[13:58:10 CEST] <BBB> 10 awaits my mov thing
[13:58:15 CEST] <BBB> 11 is the simple change
[13:58:16 CEST] <BBB> ok
[13:58:36 CEST] <BBB> you said youd do the changes to mdec to work, right?
[13:59:59 CEST] <J_Darnley> Yes
[14:00:50 CEST] <BBB> J_Darnley: I changed my opinion on 3, just push as-is
[14:01:06 CEST] <BBB> 4 needs some modifications as suggested by michaelni, theres example code elsewhere of how to do that
[14:01:33 CEST] <J_Darnley> I'm sure I'll manage
[14:02:00 CEST] <BBB> e.g. mpegvideo_enc.c:1013-1033
[14:02:04 CEST] <BBB> ok :)
[14:02:28 CEST] <J_Darnley> I've got other things to do so I will just do the "push as is" ones for now
[14:05:04 CEST] <BBB> I notice youre also tired of this patchset :-p
[14:05:52 CEST] <J_Darnley> Well yes, partly.  But I also feel like I'm not working on the things kierank wants me to work on.
[14:06:55 CEST] <cone-165> ffmpeg 03James Darnley 07master:9d11fedd1129: avcodec/mdec: override IDCT choice before initing DSP structs
[14:09:07 CEST] <kierank> BBB: yeah, there's stuff we need to do at job
[14:21:22 CEST] <BBB> :-p
[14:21:38 CEST] <BBB> Ill test if the idct simple override in mdec is still needed
[14:21:41 CEST] <BBB> I dont think it is
[14:21:51 CEST] <BBB> I think the reason it was there is because of the zigzag bug which also affected the mmx simd
[14:22:00 CEST] <BBB> with that fixed (4/11), that patch can go
[14:22:17 CEST] <BBB> s/patch/code/
[14:22:19 CEST] <BBB> anyway
[14:22:22 CEST] <BBB> Ill do that part
[14:22:31 CEST] <BBB> I think its mostly done anyway
[14:38:13 CEST] <kierank> BBB: sorry :)
[14:38:14 CEST] <kierank> :(
[14:38:24 CEST] <BBB> I get it
[14:47:44 CEST] <BBB> patch for mdec sent
[14:48:20 CEST] <BBB> I probably need to rebase it on top of j_darnleys pushed patch, but then his patchset is pushed all the way until 9/11
[15:04:37 CEST] <BBB> J_Darnley: that one was fairly simple
[15:05:56 CEST] <J_Darnley> ah you did that in the end
[15:06:42 CEST] <BBB> I meant the .mov file michaelni says changed
[15:07:05 CEST] <J_Darnley> oh my mistake
[16:35:40 CEST] <J_Darnley> Does anyone know how to fast-forward a local branch I'm not on to the current HEAD?
[17:15:38 CEST] Action: J_Darnley curses git
[17:16:07 CEST] <tdjones> You want the other branch to have the same head as the branch you're on?
[17:17:06 CEST] <iive> i think he is detached from the current HEAD of the branch
[17:48:30 CEST] <durandal_1707> who the fuck designed activate in lavfi?
[18:26:28 CEST] <jamrial> durandal_1707: a reason why the ticket is "invalid" would be helpful there
[18:40:09 CEST] <TMM> evening all
[18:44:46 CEST] <cone-165> ffmpeg 03James Darnley 07master:8221c7170317: avcodec/x86: allow future 8-bit simple idct to use slightly different coefficients
[18:55:15 CEST] <atomnuker> why does VBROADCASTSD not work with 2 registers and only accepts 1 register and 1 address
[18:55:25 CEST] <atomnuker> the non-macro version supports 2 registers
[18:56:56 CEST] <jamrial> atomnuker: that macro was written before avx2 was a thing i guess
[18:56:59 CEST] <atomnuker> also what's the alternative to movlhps? using INIT_YMM instead of xmm results in "invalid combination of opcode and operands" when using it to splat 2 floats to the entire register
[18:57:47 CEST] <J_Darnley> vbroadcastq?
[18:57:52 CEST] <jamrial> with avx, you need to splat using xmm then vinsertf128 to splat to the upper half of the reg
[18:58:06 CEST] <atomnuker> I'm only using 1/2 the full reg here
[18:58:14 CEST] <jamrial> with avx2 you have vbroadcastsd
[18:59:17 CEST] <atomnuker> vbroadcastsd still makes nasm print the same error
[18:59:38 CEST] <jamrial> using what?
[18:59:43 CEST] <atomnuker> 2 registers
[18:59:51 CEST] <jamrial> xmm source and ymm dest?
[18:59:58 CEST] <jamrial> otherwise it wont work
[18:59:59 CEST] <atomnuker> no, both ymm
[19:00:02 CEST] <atomnuker> ah, ok
[19:00:03 CEST] <jamrial> that wont work :p
[19:00:19 CEST] <atomnuker> wait, I can combine xmm and ymm regs?
[19:00:27 CEST] <atomnuker> I thought once I INIT_XMM/YMM, that was it
[19:00:46 CEST] <J_Darnley> xmN ymN
[19:00:46 CEST] <jamrial> that just makes the m* alias expand into either xmm or ymm
[19:01:08 CEST] <jamrial> use the xm* alias to force xmm regs when you use INIT_YMM
[19:02:49 CEST] <jamrial> INIT_XMM -> m0 == xmm0, xm0 == xmm0
[19:02:53 CEST] <jamrial> INIT_YMM -> m0 == ymm0, xm0 == xmm0
[19:04:40 CEST] <iive> you can overried register with prefixing it xmm%1 , there are redefines xmmymm into xmm.
[19:05:57 CEST] <iive> atomnuker: i have vbroadcastss emulation in my patch, you can easy change it to *sd
[19:34:18 CEST] <BtbN> hm, I can't think of a clean way to make ffmpeg.c automatically use the cuvid/qsv variants of decoders 
[19:40:50 CEST] <durandal_1707> nice way to troll us: write convoluted code nobody understand
[20:17:20 CEST] <BBB> J_Darnley: Ill push
[20:17:45 CEST] <BBB> J_Darnley: just waiting for michaelni to wake up and see if he objects
[20:18:32 CEST] <J_Darnley> noted
[22:15:44 CEST] <cone-973> ffmpeg 03John Rummell 07master:966a0a814d4e: avcodec/decode: Update decode_simple_internal() to get the side data correctly.
[22:56:54 CEST] <atomnuker> what's the standard way of moving 2 xmm vectors into 1 ymm vector (1 high and 1 low)?
[23:01:34 CEST] <Gramner> vinserti128
[23:03:03 CEST] <nevcairiel> the xmm regs are basically half of the actual ymm regs, arent they? so mova the low part into xm0, and vinserti128 the high part into ym0?
[23:03:34 CEST] <atomnuker> vinserti128 m4, m4, xm5 -> invalid combination of opcode and operands
[23:04:00 CEST] <nevcairiel> it needs a imm8 as the 4th argument
[23:04:11 CEST] <nevcairiel> to control it
[23:04:26 CEST] <atomnuker> oh
[23:04:35 CEST] <atomnuker> btw, mova m4, xmm5 -> invalid combination of opcode and operands
[23:05:33 CEST] <nevcairiel> afaik, 0 for low, 1 for high, in the imm8
[23:06:33 CEST] <nevcairiel> to fill a full ymm, you would do like: mova xm4, xm5; vinserti128 m4, m4, xm6, 1;
[23:06:45 CEST] <nevcairiel> (or just take over xm5 if you dont need it anymore)
[23:07:13 CEST] <atomnuker> oh, ok, you can switch between ymm and xmm without losing data then
[23:07:35 CEST] <nevcairiel> well, you can go upwards i think, down requires zeroing the upper parts
[23:07:53 CEST] <nevcairiel> or something
[23:08:28 CEST] <jamrial> <@atomnuker> btw, mova m4, xmm5 -> invalid combination of opcode and operands
[23:08:29 CEST] <Gramner> vinserti128 dst, src_low(ymm), src_high(xmm), 1
[23:08:29 CEST] <jamrial> mova ymm, xmm is of course invalid
[23:13:52 CEST] <TMM> with regular registers I don't think that going from al/ah to eax/rax ever destroys data
[23:15:27 CEST] <nevcairiel> well you just have to hope the datatype you have in there thought of that :) but thats not a problem with simd regs anyway, since you increase the number of elements, not the size of the data
[23:20:15 CEST] <atomnuker> having 8 floats in 1 registers is huge, how can anyone deal with this many values, this should be outlawed
[23:20:29 CEST] <J_Darnley> :)
[23:20:29 CEST] <atomnuker> btw is there any way to not resolve arguments passed to macros?
[23:20:30 CEST] <BBB> you poor fellow
[23:20:37 CEST] <BBB> we integer people use 32 values per register
[23:20:43 CEST] <BBB> muhahahaha
[23:20:47 CEST] <BBB> :-p
[23:20:59 CEST] <TMM> Pff, I write only code using short ints
[23:21:01 CEST] <atomnuker> e.g. FFT5 m4 when INIT_YMM is on resolves the m4 to ymm4 and replaces it in the macro
[23:21:03 CEST] <nevcairiel> i have a cpu on the way that fits 16 floats, get with the times
[23:21:05 CEST] <TMM> I do 64 values per register
[23:21:28 CEST] <durandal_170> write asm for ffr
[23:21:35 CEST] <atomnuker> ffr?
[23:21:42 CEST] <BBB> atomnuker: if you want it to stay m so you can do x%d or y%d, use just 4 as arugment
[23:21:57 CEST] <BBB> atomnuker: see e.g. how transposeNxN do that
[23:22:00 CEST] <atomnuker> yeah, but I use just "4" just before it as an offset
[23:22:10 CEST] <atomnuker> and it looked kinda bad, but w/e, it works
[23:22:18 CEST] <BBB> its a little shitty yes
[23:22:27 CEST] <BBB> assembly is never super-pretty :(
[23:22:41 CEST] <BBB> especially if you only have 8 floats instead of 32 uint8_ts <evil>
[23:22:44 CEST] <BBB> :-p
[23:22:47 CEST] <atomnuker> it was before I decided I could use huge registers
[23:22:49 CEST] <rcombs> atomnuker: pretty much anything _writing_ to an xmm will zero the upper 128 bits if V-prefixed and leave them unmodified if not
[23:23:13 CEST] <durandal_170> fft
[23:23:15 CEST] <rcombs> while reading is nondestructive
[23:23:20 CEST] <atomnuker> now I can't think about it looking ugly, I have 8 floats in 1 register
[23:23:42 CEST] <atomnuker> durandal_170: I plan to some day, fft64 doesn't have simd and its generated by lesser ffts
[23:23:57 CEST] <atomnuker> (this is for opus/aac's mdct15 which uses a 15-point fft)
[23:24:14 CEST] <atomnuker> 15 complex values -> 30 floats
[23:24:37 CEST] <rcombs> (also, mixing VEX-prefixed vector instructions with non-VEX-prefixed ones is a very bad idea, but x86inc makes it easy not to do that)
[23:24:46 CEST] <rcombs> oh this reminds me
[23:24:57 CEST] <nevcairiel> if you're using floats, shouldnt it be vinsertf128 instead of the "i" variant to avoid the float/int context thing?
[23:25:10 CEST] <rcombs> does nasm have the same issue as yasm, where if your code is macro-ized, it's impossible to get any useful line number information out of diagnostics
[23:25:20 CEST] <jamrial> yes, and also because the integer ones are avx2, not avx
[23:25:51 CEST] <rcombs> even at compile-time, if you fucked up operands yasm would tell you there was an error on the line the macro was invoked on, rather than the line in the macro that was wrong
[23:25:54 CEST] <Gramner> yes, f for floats
[23:26:00 CEST] <atomnuker> rcombs: nope, it works fine for me
[23:26:07 CEST] <rcombs> GOOD
[23:26:09 CEST] <atomnuker> prints the error inside the macro on the correct line
[23:26:16 CEST] <rcombs> EXCELLENT
[23:26:38 CEST] <BBB> whats nasm?
[23:26:46 CEST] <BBB> are we on that again?
[23:26:50 CEST] <BBB> is it in macports?
[23:26:59 CEST] <nevcairiel> yasm is dead
[23:27:08 CEST] <nevcairiel> nasm supports new things like avx512
[23:27:13 CEST] <BBB> nasm @2.13.01 (lang)
[23:27:15 CEST] <rcombs> >macports
[23:27:16 CEST] <BBB> is that good enough?
[23:27:20 CEST] <Gramner> yes
[23:27:21 CEST] <rcombs> is that actually a thing people still use
[23:27:23 CEST] <nevcairiel> yes thats the most recent version
[23:27:28 CEST] <BBB> rcombs: what should I use then?
[23:27:29 CEST] <rcombs> isn't everyone on brew now
[23:27:40 CEST] <rcombs> BBB: brew sucks about 95% less than port and fink
[23:27:42 CEST] <BBB> I can try brew, I dont know anything dude :D
[23:27:57 CEST] <BBB> I use stuff because it worked 100 years ago and Im too lazy to try anything else
[23:28:32 CEST] <rcombs> oh, also
[23:28:43 CEST] <rcombs> does nasm write debug information in Mach-O
[23:28:59 CEST] <rcombs> and function bounds, and such
[23:29:10 CEST] <BBB> does nasm have the same CLI invocation as yasm?
[23:29:21 CEST] <Gramner> it has macho-o dwarf debug support
[23:29:22 CEST] <BBB> nope
[23:29:40 CEST] <rcombs> nice
[23:29:43 CEST] <nevcairiel> its similar-ish, but ffmpeg build system should support both
[23:30:05 CEST] <Gramner> but you can make command lines that works with both nasm and yasm
[23:33:29 CEST] <jamrial> does anyone know what's this DBG stuff in the build system that creates .dbg.asm files and how do i enable/test it?
[23:34:45 CEST] <iive> jamrial: it is like gcc -S, runs only the pre-processor, and then strips the debug information from the result file
[23:35:07 CEST] <jamrial> how do i enable it? a configure option?
[23:35:15 CEST] <iive> so in case of error, you get meaningful line of error, instead the address of the big macro
[23:35:32 CEST] <iive> jamrial: make variable
[23:37:37 CEST] <jamrial> iive: which one? DBG doesn't seem to work
[23:37:46 CEST] <iive> let me rephrase, if compilation fails, the yasm and old nasm would usually give the line where the error hapened, in our case it would be the single line with the macro that contains the implemented function.
[23:37:53 CEST] <iive> make DBG=1 ?
[23:38:12 CEST] <Compn> iive : is it a nasm/yasm flag ? 
[23:38:22 CEST] <iive> Compn: no
[23:38:59 CEST] <rcombs> sounds like that's fixed in current nasa?
[23:39:01 CEST] <rcombs> *nasm
[23:39:12 CEST] <rcombs> (going by what atomnuker said)
[23:39:19 CEST] <rcombs> I used to sometimes do that manually
[23:39:35 CEST] <jamrial> iive: ah, i see, i need to call "make whatever.dbg.asm DBG=1", not "make whatever.o DBG=1" for it
[23:39:39 CEST] <jamrial> thanks
[23:39:50 CEST] <nevcairiel> are you sure you even need the DBG=1 then
[23:40:13 CEST] <nevcairiel> you can do like whatever.S/i as well to get gcc to give you various intermediates
[23:40:15 CEST] <iive> it calls yasm -e , this just expands all macros and writes the result in a temp file. then strips that file from debugging information, because yasm actually keeps like data in that file
[23:40:45 CEST] <jamrial> nevcairiel: apparently not because the command is harcoded to use DEFAULT_YASMD instead of YASMD, which is what DBG sets
[23:40:46 CEST] <iive> then compiles the stripped file, so if some instruction fails compilation, you get the line number in the temp stripped file.
[23:41:51 CEST] <iive> s/like data/ line data
[23:50:40 CEST] <wm4> can't wait until nasm dies and yasm undies again
[23:51:06 CEST] <durandal_170>   w h y ?
[23:51:36 CEST] <jamrial> yasm-ng
[23:51:51 CEST] <BBB> nasm doesnt seem to deal with -I and %includes in subpaths correctly
[23:51:52 CEST] <iive> zasm!
[23:51:57 CEST] <iive> for zmm registers
[23:52:06 CEST] <BBB> I have -I .. and %include path/to/bla.asm
[23:52:12 CEST] <BBB> with bla.asm being in ../path/to/
[23:52:14 CEST] <BBB> and it cant find it
[23:52:17 CEST] <BBB> yasm finds it correctly
[23:52:23 CEST] <BBB> so thats the end of me using nasm for the day
[23:52:33 CEST] <durandal_170> write patch
[23:52:39 CEST] <BBB> why
[23:52:42 CEST] <BBB> yasm works correctly ;)
[23:52:49 CEST] <BBB> Im not activist about which assembler I use
[23:52:52 CEST] <BBB> its a tool, not a goal
[23:52:52 CEST] <iive> write bugreport
[23:52:55 CEST] <atomnuker> avx512
[23:52:56 CEST] <BBB> I dont care about yasm vs nasm
[23:53:00 CEST] <BBB> people tell me nasm is amazing
[23:53:02 CEST] <BBB> I test it, it fails
[23:53:04 CEST] <BBB> I complain
[23:53:07 CEST] <BBB> and move on with my life
[23:53:08 CEST] <BBB> ;)
[23:53:19 CEST] <durandal_170> but yasm is d e a d
[00:00:00 CEST] --- Wed Jun 21 2017