[Ffmpeg-devel-irc] ffmpeg-devel.log.20140524

Sun May 25 02:05:02 CEST 2014

[02:10] <cone-205> ffmpeg.git 03James Almer 07master:7538ad224835: x86/hevc_deblock: improve chroma functions register allocation
[02:15] <cone-205> ffmpeg.git 03Lou Logan 07master:9eaa8c22bc40: MAINTAINERS: remove myself as website maintainer
[03:43] <cone-205> ffmpeg.git 03Christophe Gisquet 07master:f0aca50e0b21: x86: hpeldsp: implement SSE2 versions
[03:52] <cone-205> ffmpeg.git 03Christophe Gisquet 07master:9722a6a3f35c: x86: hpeldsp: implement SSE2 put_pixels16_xy2
[05:09] <cone-205> ffmpeg.git 03Billy Shambrook 07master:308188be3412: Add metadata injection to blackdetect
[11:10] <cone-248> ffmpeg.git 03Clément BSsch 07master:cba92a222615: avformat/vobsub: do not create empty streams.
[13:33] <cone-248> ffmpeg.git 03Nidhi Makhijani 07master:8692e6284f51: rdt: check malloc calls
[13:33] <cone-248> ffmpeg.git 03Michael Niedermayer 07master:726316240bcc: Merge commit '8692e6284f5169257a537c8fc25addf32fc67c87'
[13:59] <ubitux> why is there a call to emms_c() at the beginning of ff_faandct()?
[14:00] <ubitux> ah mmh well it makes sense somehow
[15:52] <ubitux> why do we have reference/simple (i)dct only for 8x8 blocks?
[15:55] <ubitux> we have dozens of dct implementation ± accurate for 8x8 but i can't see any for 4x4, 16x16, etc, except in codec specific code
[15:56] <kierank> probably because those are defined to be bitexact
[16:02] <ubitux> AFAICS, we have: jfdctint which defines some fdct not that much accurate based on aan, jfdctfst.c is the aan one, so not yet bitexact (but more accurate than the previous one), then we have a float version of the aan (faandct.c), then we have the idct in simple_idct*, then we have the reference in dct*
[16:02] <ubitux> and then, there all the codec specific ones
[16:02] <ubitux> did i miss any?
[16:03] <ubitux> ah, and all of those are 8x8
[16:04] <ubitux> i don't know which of those have an asm version btw
[16:45] <ubitux> ah there is jrevdct.c for some more matching idct
[16:46] <Daemon404> there should be matching asm
[16:46] <Daemon404> theyre from ijg, so you could swap out jpeg-turbo asm
[16:46] <Daemon404> no?
[16:50] <ubitux> no idea.
[16:53] <ubitux> btw, does anyone knows some algorithm names for making a sliding window 2d dct faster?
[16:53] <ubitux> like i mean, there is just one column of the input changing after each iteration (of course shifted in one direction)
[16:53] <ubitux> nothing can be re-used between these dct?
[17:00] <cone-248> ffmpeg.git 03Christophe Gisquet 07master:81aa0f4604f9: x86: hpeldsp: implement SSSE3 version of _xy2
[18:59] <kurosu_> michaelni: I have some questions about mpegvideo, in particular decoding and mpeg2
[18:59] <kurosu_> my intent is to do like all cool decoders, take into account the last coeff index
[19:00] <kurosu_> (so as to apply a DC only, but one may want to implement a pure vertical or horizontal version)
[19:00] <kurosu_> in the idct
[19:01] <kurosu_> in profiling idct is now 25% of the decoding time and I guess most of the time is spent on vlc decoding then
[19:01] <kurosu_> however, that last coefficient count is wrong
[19:02] <iive> vlc is in the dequant functions.
[19:02] <nevcairiel> as if you could turn mpeg2 into a cool decoder
[19:02] <nevcairiel> psh
[19:03] <kurosu_> yeah, there's dxva/vdpau/whatever for that
[19:03] <kurosu_> but I specialize in useless stuff
[19:04] <michaelni> kurosu_, with mpeg2 be carefull as th dequantization can flip the LSB of coeff 63 even when coeff 63 is 0
[19:04] <kurosu_> michaelni: already noticed that
[19:04] <kurosu_> the dequant function are not taking this into account to update the index
[19:04] <kurosu_> although the alternate scan forces 63 but at least the index is higher than the real one
[19:05] <iive> if I remember correctly, the last coefficient is actually in the linear sense, so after the zig-zag when it is put in 2d matrix it might be somewhere else completely.
[19:05] <kurosu_> so my question is, do you know if there could be other issue
[19:06] <kurosu_> on the other hand, one would expect idx=1 to mean "dc" but the index is often completely different
[19:07] <michaelni> know? no
[19:07] <iive> idx=0 is dc
[19:07] <michaelni> iive, yes
[19:07] <kurosu_> err, yeah off-by-one error
[19:07] <kurosu_> anyway, my comment still applies
[19:08] <iive> afaik, idct routines check if all coefficient in a row/col are zero, and skip processing it if that is the case
[19:09] <kurosu_> maybe the c implementation, I haven't checked the inline asm
[19:09] <iive> also, the matter is a little bit more complicated, as to help paralization, some mmx variants use strange permutations.
[19:10] <kurosu_> ok, important to know
[19:10] <kurosu_> but I don't think it has an impact for dc-only
[19:11] <kurosu_> in the add/put dct functions, I recomputed the last index, and if 0, did a add offset and clamp
[19:11] <kurosu_> which makes me thing that every codec is reimplementing that dc-only version, which is kind of inefficient
[19:12] <kurosu_> maybe I could start there instead: convert all those to use a common function, and then codec-specific versions only compute the average pixel offset from the dc coeff
[19:13] <kurosu_> and pass it to that function
[19:14] <iive> well, it depends...
[19:15] <iive> doning a check in the codec before calling function by a pointer might be faster than doing the check in the given function.
[19:18] <iive> some cpu can speculatively execute or fetch instructions in advance, even when there is function jump. however I'm not sure that would be done if the function address is not fixed, should be read from memory or is held in register.
[19:18] <iive> so, better benchmark it.
[19:18] <iive> if you want ideas....
[19:19] <iive> i've thought is donig dequantization in SIMD could be beneficial.
[19:20] <iive> the idea is that you need 3 linear arrays. in the first you write the coefficient. in the second you write the runlength, or coefficient position. in the third you write the reciprocial quantization matrix coefficients.
[19:21] <iive> then you can do SIMD on 1'st and 3'd arrays using SIMD.
[19:21] <iive> ideally, the first stage of the idct could be done without having to unpack the coefficients.
[19:24] <iive> guh, why i'm writing donig instead of doing....
[19:25] <clever> and doning
[19:25] <iive> yeh... 
[19:27] <iive> another idea.
[19:27] <kurosu_> actually, dequantization is ideally done when either parsing vlcs or doing idct, but the former is simpler
[19:29] <iive> in h264 cow and row operations are the same, so they just do a permutation and repeat the function.
[19:30] <iive> older mpeg idct doesn't do that.
[19:31] <iive> so, different route is to do the horizontal stage in parallel for a number of different blocks.
[19:31] <kurosu_> my biggest issue now is more that all these add clamped functions / add_dc across vp3/vp8/rv40/vc1 x86 asm are the same concept
[19:31] <kurosu_> let's go for the easiest and pratical path maybe ?
[19:32] <iive> i'm kind of afraid that the easiest paths might have already been tried :}
[19:39] <iive> but it doesn't heart to try
[19:39] <iive> hurt
[19:39] <iive> gah...
[19:39] <kurosu_> the biggest roadblock being the other archs where I'd need to add equivalent assembly
[19:42] <iive> you can always do C wrapper that does the dc check and fallbacks to the old existing asm.
[19:43] <iive> should be better that falling back to pure C, but there is always the danger of been too good and nobody bothering writing the asm ;)
[21:43] <cone-248> ffmpeg.git 03Michael Niedermayer 07master:9a7d332b923b: avcodec/aacenc: dont use global quality if its negative
[21:43] <cone-248> ffmpeg.git 03Michael Niedermayer 07master:ddeb58b90c41: avcodec/asvenc: dont use a negative global_quality
[22:17] <cone-248> ffmpeg.git 03James Almer 07master:61eea421b23f: x86/dsputilenc: port sum_abs_dctelem functions to yasm
[23:41] <BBB> saves 11 cycles?
[23:41] <BBB> he probably means 1.1 (i.e. 1) right?
[00:00] --- Sun May 25 2014