[Ffmpeg-devel-irc] ffmpeg-devel.log.20191209

burek burek at teamnet.rs
Tue Dec 10 03:05:04 EET 2019


[03:06:12 CET] <linjie> For CBR, bitrate and maxrate should be set and have the same value
[03:06:55 CET] <cehoyos> Sounds neither necessary nor sufficient
[06:30:03 CET] <linjie> Is there any test in fate to check whether the result of C and asm matches?
[06:35:11 CET] <rcombs> linjie: checkasm
[07:01:41 CET] <linjie> thx, verified with checkasm --test=hevc_add_res
[11:51:29 CET] <durandal_1707> https://pastebin.com/QMjRhye6  ---> what is purpose of this multiplications prior to idct8(and multiplication with sqrt(2)) ?
[12:58:57 CET] <durandal_1707> Lynne: i tried bunch of staff and 32 blocks decode only with blocky artifacts
[14:14:35 CET] <durandal_1707> does order matters if i do 8x4 idct so that idct8 comes before idct4?
[14:27:02 CET] <BBB> durandal_1707: in mathematical sense it doesn't matter, but the spec typically requires it to be done in a particular order to be normatively consistent
[14:27:52 CET] <durandal_1707> does it make output different?
[14:27:56 CET] <BBB> yes
[14:27:59 CET] <BBB> but not by a lot
[14:28:22 CET] <BBB> if you have visually significant artifacts, that's probably not the reason
[14:28:30 CET] <BBB> if you have rounding errors (off-by-ones), this may be why
[14:30:16 CET] <durandal_1707> i cant get 8x4 idct to work, codec uses idct 8x4 with scans of two 4x4 coined togethers so table goes like 0, 4, 8, 12, 1, 5...
[14:31:05 CET] <kurosu> that's interlaced? isn't that the dv one ?
[14:31:22 CET] <durandal_1707> 0,  1, 2, 3,   4, 5, 6, 7,
[14:31:22 CET] <durandal_1707> 8,  9,10,11,  12,13,14,15,
[14:31:22 CET] <durandal_1707> 16,17,18,19,  20,21,22,23,
[14:31:23 CET] <durandal_1707> 24,25,26,27,  28,29,30,31,
[14:31:48 CET] <BBB> I can count to thirty-one
[14:31:53 CET] <BBB> :D
[14:32:50 CET] <durandal_1707> yes, but to zizzag of 4x4 of left table with right table, interleave it and you get this 8x4 table
[14:34:45 CET] <durandal_1707> kurosu: i do not think this is interlaced, the two 4x4 block are combined together and done idct on it
[14:36:08 CET] <kurosu> in any case, maybe look up ff_simple_idct248_put
[14:36:57 CET] <durandal_1707>  0,   4,   8,  12
[14:36:57 CET] <durandal_1707>  1,   5,   2,   6
[14:36:57 CET] <durandal_1707>  9,  13,  16,  20
[14:36:57 CET] <durandal_1707> 24,  28,  17,  21
[14:36:57 CET] <durandal_1707> 10,  14,   3,   7
[14:36:59 CET] <durandal_1707> 11,  15,  18,  22
[14:37:02 CET] <durandal_1707> 25,  29,  26,  30
[14:37:04 CET] <durandal_1707> 19,  23,  27,  31
[14:37:14 CET] <kurosu> ah
[14:37:30 CET] <kurosu> I skipped your pasted array indicating the order
[14:37:33 CET] <kurosu> indeed unrelated
[14:44:02 CET] <durandal_1707> https://0x0.st/zUA5.png
[14:44:14 CET] <durandal_1707> its blocky and with no detail
[14:48:10 CET] <durandal_1707> https://pastebin.com/gZeGTETs
[14:48:19 CET] <durandal_1707> this is idct84_12bit
[15:04:42 CET] <BBB> so, I don't know how you typically od this sort of of thing, but I typically try to have a reference function to compare it to (this can be asm from a binary)
[15:04:49 CET] <BBB> so that you can run it and compare variables and states
[15:05:02 CET] <BBB> and then just run it and compare stuff
[15:05:08 CET] <BBB> not over a frame, but just over one block
[15:05:12 CET] <BBB> like checkasm
[15:06:34 CET] <Lynne> this is decompiled code, and small tables like this zigzag are usually inlined and not in rodata, its probably impractical
[15:07:41 CET] <durandal_1707> well, its full of heavily optimized simd that multiplies coefficients before calling idct
[15:09:17 CET] <Lynne> from what I can see from the picture there are no higher level coefficients, everything is a simple gradient
[15:15:21 CET] <durandal_1707> yea, AC have bunch of zeros
[15:22:57 CET] <durandal_1707> hmm, i think i know why artifacts, our idct does different permutation with zigzag_direct
[15:23:38 CET] <durandal_1707> i just need to figure out right table for 32 elements, that will give permutated output of 32 elements
[15:24:42 CET] <durandal_1707> why is this permutated at all?
[15:38:53 CET] <BBB> because top/left coefs are more likely to be non-zero than higher-frequency ones
[15:39:14 CET] <BBB> so it aligns likelihood (statistical) of being non-zero with coding order
[15:39:22 CET] <BBB> which makes it cheaper from a bits-spent perspective
[15:39:55 CET] <durandal_1707> so how to i permute above table (0, 4..) so that i get elements index less than 32?
[15:41:29 CET] <BBB> I don't understand the question, they are all under 32 already
[15:41:41 CET] <BBB> or do you mean how to invert that table?
[15:41:58 CET] <BBB> I don't see why a decoder should need that?
[15:42:21 CET] <durandal_1707> decoder uses direct zigzag for 8x8
[15:42:36 CET] <durandal_1707> ffmpeg uses permuted one of this i think
[15:42:54 CET] <BBB> ffmpeg uses permuted one for what?
[15:43:05 CET] <durandal_1707> for dct 8x8
[15:43:07 CET] <BBB> no
[15:43:10 CET] <BBB> idct is not permuted
[15:43:13 CET] <BBB> idct is in regular order
[15:43:16 CET] <BBB> coefficient coding is permuted
[15:43:53 CET] <BBB> so you "un-permute" it during coeficient decoding
[15:44:07 CET] <durandal_1707> if i use ff_init_scantable(s->idsp.idct_permutation, &s->scan[1], hh_scan); I get index elements > 31
[15:44:20 CET] <durandal_1707> and that does not work as it reads only 32 of them
[15:44:55 CET] <BBB> what codec is this? just don't use ff_init_scantable :)
[15:45:08 CET] <BBB> vp9 and vp8 and av1 don't use it
[15:45:43 CET] <BBB> init_scantable only does 8x8
[15:46:11 CET] <Lynne> ^^^
[15:47:00 CET] <durandal_1707> without permutation i get invalid results for Y plaen
[15:47:01 CET] <BBB> vc1 uses 8x4, and doesn't use init_scantabl
[15:47:03 CET] <durandal_1707> *plane
[15:47:12 CET] <BBB> show coef decoding code
[15:47:22 CET] <BBB> you can inline the permutation in the coef code
[15:47:55 CET] <BBB> I can explain the point of init_scnatable, but if you're going to use your own custom idct implementation, it's not relevant
[15:48:08 CET] <BBB> init_scantable is for simd implementations that use a different permutation than the C
[15:48:14 CET] <BBB> typically a partial transpose or full transpose
[15:48:23 CET] <BBB> so the idct transposes once instead of twice, which saves a lot of cycles
[15:48:36 CET] <BBB> but if you have a custom idct, none of that is relevant
[15:48:40 CET] <BBB> so then you just don't use it
[15:49:12 CET] <BBB> the idea was that you can keep "old" asm for neon while impelmenting "new" (depending-on-transpose) asm for x86
[15:49:14 CET] <durandal_1707> i use ffmpeg idct 12bit
[15:49:27 CET] <BBB> no you don't
[15:49:29 CET] <BBB> ffmpeg has no 4x8
[15:49:49 CET] <durandal_1707> well, for Y plane, I use 8x8
[15:50:12 CET] <durandal_1707> and that one looks to need it, or table is transposed
[15:50:44 CET] <durandal_1707> for pseudo U/V i wrote 8x4 hack
[15:52:01 CET] <BBB> so, you can't use init_scantable for that, you'll have to make a custom permutation table that is not 8x8 (which init_scantable assumes)
[15:53:17 CET] <BBB> or if the assumption is that the idct is C anyway, just tell it the permutation is identity
[15:53:23 CET] <BBB> and then it does nothing
[15:53:34 CET] <BBB> (except overwriting/reading your array by 32, but that's probably ok)
[15:58:21 CET] <durandal_1707> i cant guess correct table from already posted one above?
[15:59:58 CET] <BBB> it's likely the one you just gave: 0,4,8,12,1,5,...
[16:00:26 CET] <BBB> I mean,it depends on what code we're talking about
[16:00:41 CET] <BBB> if you're readin two sets of 4x4 coefs and do a quasi-4x8 uv idct
[16:02:02 CET] <BBB> then you need to write 0,8,1,2,9,16,24,17,10,3,11,18,25,26,19,27 for U
[16:02:20 CET] <BBB> and x+4 in V
[16:02:29 CET] <BBB> and that gives you the correct thing for a quasi-4x8 combined u+v idct
[16:02:33 CET] <BBB> but it's a little weird
[16:30:35 CET] <BBB> thinking about it more, you can't use a 4x8 as quasi-idct as is, you need to do two neighbour-"interleaved" idct4s, not one idct8
[16:30:38 CET] <BBB> it's not the same thing
[16:30:43 CET] <BBB> unless the bitstream is weird
[16:30:49 CET] <BBB> I really don't know what you're trying to do :D
[16:31:28 CET] <durandal_1707> BBB: they do indeed weird things, the float path use 8x8 idct for two sets of 32 coefficients
[16:32:24 CET] <Lynne> BBB: that sounds too weird, there's just no reason to group 2 4x4 idcts
[16:32:48 CET] <Lynne> but if they do then the zigzags make sense
[16:36:00 CET] <durandal_1707> BBB: feel free to look if you have time: https://pastebin.com/QMjRhye6
[16:36:36 CET] <BBB> ooo float code
[16:37:14 CET] <j-b> /j ffmpeg-meeting
[16:37:17 CET] <BBB> is this audio or video? you still haven't told me what this is supposed to do
[16:37:34 CET] <durandal_1707> BBB: blacmagic raw/.braw
[16:37:42 CET] <durandal_1707> video
[16:38:33 CET] <BBB> auVar26 = vpermq_avx2(auVar26,0xd8);
[16:38:38 CET] <BBB> I can see someone loving his lanes there
[16:38:47 CET] <BBB> some hacks are omni-present
[16:39:17 CET] <durandal_1707> i have integer only variant too
[16:40:30 CET] <durandal_1707> https://pastebin.com/ejEGuWe6
[16:41:20 CET] <durandal_1707> i do not get why integer variant does simd _before_ extracting of coefficients
[16:59:45 CET] <Illya> durandal_1707: remember to join meeting
[17:00:25 CET] <j-b> https://meet.google.com/eqm-icti-vpr
[18:25:01 CET] <durandal_1707> BBB: any ideas?
[18:28:08 CET] <BBB> not yet
[18:28:31 CET] <BBB> I was participating in the meeting so didn't look at it yet
[20:10:40 CET] <williamto> Hi guys, today I drilled down to the `ff_hscale_8_to_15_neon` function for aarch64 system and performed data prefetching for load instructions
[20:11:06 CET] <williamto> https://imgur.com/a/jJPTh9r
[20:11:55 CET] <williamto> at first the overhead decreased, but after re-building and running the same command again with profiling, the overhead increased. Does anybody know why it's having this behaviour?
[20:12:24 CET] <williamto> shouldn't prefetching the data into cache make load instructions *slightly* faster in general?
[20:34:36 CET] <williamto> has there been any discussion before about implementing data prefetching for load instructions?
[20:36:20 CET] <nevcairiel> most modern cpus do automatic  prefetching when you access data in predictable patterns
[20:36:58 CET] <nevcairiel> there is no magic bullet to improve memory load bottlenecks
[20:37:01 CET] <kurosu> williamto: drop it, you've hit a bottleneck
[20:37:05 CET] <kurosu> yeah, exactly
[20:38:01 CET] <kurosu> that and what benefit you get in local micro-benchmarks often disappear once inside a program where different parts compete for the memory bandwidth
[20:42:53 CET] <williamto> oh ok, I understand now.
[20:44:25 CET] <williamto> Thank you guys for helping me!
[22:17:32 CET] <burek123> test
[22:17:41 CET] <burek123> oh great :)  hi all :)  long time no see :)
[22:17:55 CET] <burek123> can anyone please +v my nick on #ffmpeg so i can ask a question :)
[22:18:12 CET] <JEEB> I don't think you need that for a question? as long as you are registered
[22:18:28 CET] <burek123> oh, right, i forgot about it, thanks man :)
[22:18:29 CET] <JEEB> since at one point publicly known channels just got a buzzload of spam
[22:18:43 CET] <JEEB> and that was the simplest way apparently to limit that
[22:18:52 CET] <burek123> sure, makes sense
[00:00:00 CET] --- Tue Dec 10 2019


More information about the Ffmpeg-devel-irc mailing list