[FFmpeg-devel-irc] IRC log for 2010-06-17

irc at mansr.com irc at mansr.com
Fri Jun 18 02:00:18 CEST 2010


[00:28:49] <BBB> and we're back!
[00:28:54] <BBB> Dark_Shikari: ping :-p
[00:32:51] <Dark_Shikari> BBB: ok
[00:33:12] <Dark_Shikari> we've demonstrated how macros can allow us to template a function
[00:33:19] <Dark_Shikari> now we will demonstrate how macros allow us to simplify a function
[00:33:26] <BBB> ok
[00:33:42] <Dark_Shikari> line 144, dct-a.asm
[00:33:58] <BBB> add4x4_idct
[00:34:10] <Dark_Shikari> Isn't that function simple?
[00:34:15] <Dark_Shikari> make a zero
[00:34:17] <Dark_Shikari> load our dct coeffs
[00:34:20] <Dark_Shikari> IDCT_1D
[00:34:21] <Dark_Shikari> transpose
[00:34:25] <Dark_Shikari> add rounding factor
[00:34:26] <Dark_Shikari> IDCT_1D
[00:34:28] <Dark_Shikari> STORE_DIFF
[00:34:58] <BBB> there's an unused label skip_prologue
[00:35:10] <BBB> I'm sure these macros do a lot of weird stuff :)
[00:35:19] <Dark_Shikari> skip_prologue is used elsewhere
[00:35:32] <Dark_Shikari> it lets you call that function without the init part
[00:35:35] <Dark_Shikari> this is used in all the idcts
[00:35:41] <Dark_Shikari> so suppose you have an 8x8 idct that does 4 4x4 idcts
[00:35:42] <BBB> oh ok
[00:35:49] <Dark_Shikari> you call "add4x4_idct_mmx.skip_prologue"
[00:36:05] <Dark_Shikari> thus you skip the initialization
[00:36:10] <Dark_Shikari> whether it be push push push, xor, or whatever
[00:36:18] <Dark_Shikari> and so you call that 4 times.
[00:36:19] <BBB> hmm...
[00:36:20] <BBB> interesting
[00:36:31] <Dark_Shikari> so as you can see here
[00:36:35] <Dark_Shikari> we've wrapped up the complexity in these macros
[00:36:38] <Dark_Shikari> some of them internally do SWAPs
[00:36:40] <Dark_Shikari> we don't care
[00:36:41] <Dark_Shikari> it handles it for us
[00:36:49] <Dark_Shikari> if we had to track the results of the swaps mentally, it would be hell
[00:36:59] <Dark_Shikari> and that's what it is for everyone else writing asm and not using x264asm.
[00:37:37] <Dark_Shikari> now, for the hardest and last bit of what I'll show you.
[00:37:45] <Dark_Shikari> line 263, sad-a.asm
[00:38:03] <BBB> call
[00:39:41] <BBB> is this what breaks up a NxN into 4 N/2xN/2 IDCTs?
[00:39:55] <Dark_Shikari> no
[00:40:04] <Dark_Shikari> er, are you in sad-a.asm?
[00:40:13] <BBB> oops, no
[00:40:14] <BBB> sorry
[00:40:17] <Dark_Shikari> But yes you're right
[00:40:18] <Dark_Shikari> That's what that does.
[00:40:24] <Dark_Shikari> in dct-a.asm :)
[00:40:30] <BBB> yeah, wrong file
[00:40:39] <BBB> intra_sad_x3_4x4
[00:40:45] <Dark_Shikari> so, you know about the 4x4 DC prediction function.
[00:40:47] <BBB> 3 function args, 3 registers
[00:40:47] <Dark_Shikari> We just did that earlier.
[00:40:50] <Dark_Shikari> Right?
[00:40:51] <BBB> yes
[00:41:02] <Dark_Shikari> well there two other "simple" modes, H and V
[00:41:08] <Dark_Shikari> N A B C D
[00:41:10] <Dark_Shikari> E E E E E
[00:41:11] <Dark_Shikari> F F F F F
[00:41:13] <Dark_Shikari> G G G G G
[00:41:14] <Dark_Shikari> H H H H H
[00:41:17] <Dark_Shikari> that's V prediction
[00:41:21] <Dark_Shikari> set the Xs equal to the left side.
[00:41:27] <BBB> ok
[00:41:28] <Dark_Shikari> er, oops, that's H prediction, obviously
[00:41:30] <Dark_Shikari> horizontal
[00:41:33] <Dark_Shikari> V prediction is:
[00:41:35] <Dark_Shikari> N A B C D
[00:41:39] <Dark_Shikari> E A B C D
[00:41:41] <Dark_Shikari> F A B C D
[00:41:43] <Dark_Shikari> G A B C D
[00:41:45] <Dark_Shikari> H A B C D
[00:41:53] <Dark_Shikari> As you can see, both are very simple.
[00:42:02] <BBB> yeah, this is libavcodec's h264pred.c
[00:42:06] <Dark_Shikari> x264 has this function to perform a "merged SAD" on the three modes.
[00:42:13] <Dark_Shikari> That is, predict each one, SAD against source pixels
[00:42:16] <Dark_Shikari> and return the three SADs.
[00:42:24] <Dark_Shikari> Of course, it doesn't need to actually store the prediction, which is part of the gain here.
[00:42:39] <Dark_Shikari> So this function will calculate three SADs.
[00:42:53] <Dark_Shikari> The purpose of going through this function will be to get you to understand shuffles.
[00:43:06] <Dark_Shikari> a "shuffle" is any operation which does no arithmetic and only serves to reorder bytes.
[00:43:22] <Dark_Shikari> This may include arbitrary shuffles, interleaves, etc.
[00:43:39] <BBB> ok
[00:43:47] <Dark_Shikari> the first one to consider is punpck
[00:43:55] <Dark_Shikari> punpck(l|h) (bw|wd|dq|qdq)
[00:44:04] <BBB> (potential example: audio channel interleaving)
[00:44:06] <Dark_Shikari> this takes the (low|high) half of each of the input registers and interleaves by them
[00:44:12] <Dark_Shikari> s/by//
[00:44:16] <Dark_Shikari> it interleaves by:
[00:44:22] <Dark_Shikari> bw: bytes
[00:44:24] <Dark_Shikari> wd: words
[00:44:27] <Dark_Shikari> dq: doublewords
[00:44:30] <Dark_Shikari> qdq: quadwords
[00:44:35] <Dark_Shikari> bw == "bytes to words"
[00:44:36] <Dark_Shikari> i.e.
[00:44:50] <Dark_Shikari> punpcklbw ABCDEFGH, IJKLMNOP = AIBJCKDL
[00:44:53] <Dark_Shikari> got it?
[00:45:42] <BBB> I think so
[00:46:01] <Dark_Shikari> on those two inputs
[00:46:04] <Dark_Shikari> what does punpckhwd do?
[00:46:47] <BBB> EFMNGHOP?
[00:47:00] <Dark_Shikari> correct
[00:47:04] <Dark_Shikari> note qdq only applies for xmmregs
[00:47:10] <Dark_Shikari> it wouldn't make sense with 64-bit regs.
[00:47:19] <BBB> ok
[00:47:27] <BBB> so it's 2 mm or 2 xmm regs
[00:47:30] <BBB> it's never 1 mm and 1 xmm
[00:48:18] <Dark_Shikari> there are only two ops that work on mm and xmm
[00:48:23] <Dark_Shikari> movq2dq
[00:48:26] <Dark_Shikari> and movdq2q
[00:48:28] <Dark_Shikari> you can guess what those do ;)
[00:49:00] <BBB> :)
[00:49:47] <Dark_Shikari> so, the other shuffles:
[00:49:51] <Dark_Shikari> mmx shuffles:
[00:49:56] <Dark_Shikari> pshufw dst, src, mask
[00:50:01] <Dark_Shikari> the "mask" determines which words to put where
[00:50:16] <Dark_Shikari> each 2 bit chunk is the index from the source to use for that word in the destination
[00:50:29] <Dark_Shikari> e.g. "2" means to use dst[2]
[00:50:38] <Dark_Shikari> sse2 shuffles:
[00:50:46] <Dark_Shikari> pshufd dst, src, mask (same as mmx, but for 32-bit)
[00:51:00] <Dark_Shikari> pshuflw dst, src, mask (only shuffles low half, copies top half)
[00:51:02] <Dark_Shikari> pshufhw (you can guess)
[00:51:09] <Dark_Shikari> ssse3 shuffles:
[00:51:17] <Dark_Shikari> pshufb src, mask
[00:51:23] <Dark_Shikari> where mask is a 128-bit reg, and each byte contains an index
[00:51:29] <Dark_Shikari> i.e. completely arbitrary shuffle of the whole reg.
[00:51:35] <Dark_Shikari> You can see why this is awesome.
[00:51:45] <BBB> any byte can go anywhere in the dest
[00:52:04] <Dark_Shikari> Yup
[00:52:27] <Dark_Shikari> now that you get the shuffles, let's go do this function.
[00:52:34] <Dark_Shikari> in this function, we have a few goals.
[00:52:40] <Dark_Shikari> 1) Get the source pixels into two mmx registers
[00:52:50] <Dark_Shikari> we have 16 source pixels, so we want to get them into two mmx registers (8 bytes each)
[00:52:57] <Dark_Shikari> with this, it takes only two SADs to calculate the total SAD.
[00:53:15] <Dark_Shikari> 2) calculate the V prediction, put it into two mmx registers accordingly, and SAD.
[00:53:21] <Dark_Shikari> 3) calculate the H prediction, put it into two mmx registers, and SAD
[00:53:30] <Dark_Shikari> 4) calculate the DC prediction, splat it across an mmx register, SAD.
[00:53:32] <Dark_Shikari> 5) store the results.
[00:53:33] <Dark_Shikari> got it?
[00:53:41] <Dark_Shikari> this means we will need a total of 6 SADs.
[00:54:03] <BBB> I got it
[00:54:49] <Dark_Shikari> k, so, let's go through the function.
[00:54:53] <Dark_Shikari> First we zero mm7.  We'll use this later.
[00:55:12] <BBB> in th efunction prototype, what is fenc and what is fdec?
[00:55:21] <Dark_Shikari> FENC = source pixels
[00:55:22] <BBB> res is storage of results of V,H,DC prediction SAD
[00:55:27] <Dark_Shikari> FDEC = reconstructed pixels
[00:55:36] <Dark_Shikari> thus, fdec contains the edge pixels for prediction
[00:55:42] <Dark_Shikari> fenc contains the source pixels we're going to compare against
[00:55:43] <Dark_Shikari> the SAD
[00:55:53] <BBB> ok
[00:55:57] <Dark_Shikari> so, by line 270, the following is the case:
[00:56:15] <Dark_Shikari> if our source pixels are numbered 0 to 15 in raster order
[00:56:21] <Dark_Shikari> mm1 contains 0...7
[00:56:23] <Dark_Shikari> mm2 contains 8...15
[00:56:32] <Dark_Shikari> mm0 contains ABCDABCD (from the chart before)
[00:56:38] <Dark_Shikari> Do you see why?
[00:56:54] <Dark_Shikari> btw, if at any point you don't know why a particular decision was made (even if it works), ask.
[00:58:17] <BBB> movd is a dword move, right?
[00:58:22] <BBB> so why does mm1 contain 8 bytes?
[00:58:38] <BBB> oh, the punpckldq
[00:58:39] <BBB> I see
[00:58:40] <Dark_Shikari> punpckldq mm1, [r0+FENC_STRIDE*1]
[00:58:45] <BBB> why don't you move 8 bytes at once?
[00:58:53] <Dark_Shikari> You can't.
[00:58:55] <BBB> movqu or so?
[00:58:56] <Dark_Shikari> the source is an array of stride 32
[00:59:03] <Dark_Shikari> er, actually, stride FENC_STRIDE
[00:59:06] <Dark_Shikari> and of width 4
[00:59:13] <Dark_Shikari> we're loading a 4x4 block of pixels, that is
[00:59:18] <Dark_Shikari> you can't move 8 at once if they're not adjacent.
[00:59:21] <BBB> ah, of course, stride!=width
[00:59:22] <BBB> ok
[00:59:31] <BBB> got it then
[01:01:01] <Dark_Shikari> so, next
[01:01:18] <Dark_Shikari> we back up mm0, the vertical prediction pixels, in mm6
[01:01:22] <Dark_Shikari> Because we're going to need these later.
[01:01:37] <Dark_Shikari> Then, because we don't want to overwrite the source pixels (we need those later too), we movq mm3, mm1
[01:01:43] <Dark_Shikari> then we do our two SADs for the vertical prediction
[01:01:45] <Dark_Shikari> add the results
[01:01:49] <Dark_Shikari> and move it out to [r2]
[01:01:55] <Dark_Shikari> And we're 1/3 done!
[01:01:57] <Dark_Shikari> got it?
[01:02:01] <BBB> yes
[01:03:17] <Dark_Shikari> ok, now the next two parts are interleaved
[01:03:21] <Dark_Shikari> so it may be slightly harder to follow
[01:03:31] <Dark_Shikari> now, we need EFGH
[01:03:33] <Dark_Shikari> But we have a problem.
[01:03:44] <Dark_Shikari> Each one is on a separate line.
[01:03:48] <Dark_Shikari> We can only load one byte at a time!  This sucks.
[01:04:10] <Dark_Shikari> Furthermore, in addition to EFGH, we need EEEEFFFFGGGGHHH
[01:04:14] <Dark_Shikari> this is the H prediction we want to SAD against.
[01:04:23] <BBB> right
[01:04:30] <Dark_Shikari> So, now comes the swarm of punpck.
[01:05:01] <Dark_Shikari> Note... there is no SIMD load smaller than movd.
[01:05:07] <Dark_Shikari> so, in order to avoid crossing cacheline needlessly (this doesn't increase the number of unpacks necessary to get what we want), we load [src-4]
[01:05:10] <Dark_Shikari> not [src-1]
[01:05:26] <Dark_Shikari> so mm3 = _ _ _ E
[01:05:30] <Dark_Shikari> mm0 = _ _ _ F
[01:05:32] <BBB> so this loads BCDE, NNNF, NNNG etc
[01:05:44] <Dark_Shikari> no, NNNE
[01:05:52] <Dark_Shikari> after punpcklbw, we have:
[01:05:55] <Dark_Shikari> _ _ _ _ _ _ E F
[01:06:01] <Dark_Shikari> and _ _ _ _ _ _ G H
[01:06:21] <BBB> yes
[01:06:44] <Dark_Shikari> then we do it again with
[01:06:48] <Dark_Shikari> punpckhwd mm5, mm4
[01:06:54] <Dark_Shikari> giving us _ _ _ _ E F G H
[01:07:08] <Dark_Shikari> And -- oh wait -- that mm6 we saved comes back
[01:07:14] <Dark_Shikari> E F G H A B C D
[01:07:20] <Dark_Shikari> and that mm7 we zeroed comes back
[01:07:31] <Dark_Shikari> psadbw with mm7.... bam.  A+B+C+D+E+F+G+H.
[01:08:04] <Dark_Shikari> got it so far?
[01:08:09] <BBB> yes
[01:08:24] <Dark_Shikari> by line 290, we have EEEEFFFFGGGGHHHH
[01:08:29] <Dark_Shikari> for the H prediction.
[01:08:33] <Dark_Shikari> see how that works?
[01:08:38] <Dark_Shikari> follow the punpcks.
[01:08:52] <BBB> yes
[01:09:10] <BBB> because you always do hxtoy, with x one bigger than in the previous call
[01:09:10] <Dark_Shikari> now, we need to do + 4 and >> 3
[01:09:14] <Dark_Shikari> Yeah
[01:09:22] <Dark_Shikari> But now, we want to do +4
[01:09:27] <Dark_Shikari> but crap, you can't have immediate constants in simd.
[01:09:34] <Dark_Shikari> And we'd rather not do a load from a memory constant.
[01:09:40] <Dark_Shikari> But we have a zero reg lying around.
[01:09:51] <Dark_Shikari> So there's another trick
[01:10:00] <Dark_Shikari> (A+4)>>3 is the same as ((A>>2)+1)>>1
[01:10:09] <Dark_Shikari> so we do psraw, then pavgw
[01:10:42] <BBB> ah, smart
[01:10:48] <Dark_Shikari> same number of instructions
[01:10:52] <Dark_Shikari> as add+shift
[01:10:55] <Dark_Shikari> but no constant needed.
[01:11:03] <Dark_Shikari> Now, with a quick punpck and pshufw with 0 (a splat)
[01:11:08] <Dark_Shikari> we have DC DC DC DC DC DC DC DC
[01:11:26] <Dark_Shikari> With 4 quick SADs, we have both our DC and H scores
[01:11:35] <Dark_Shikari> we add those up and store them
[01:11:42] <Dark_Shikari> and then we return.
[01:12:08] <Dark_Shikari> And we're done with the veritable storm of punpcks.
[01:12:48] <BBB> scary stuff to read through
[01:12:56] <BBB> but when you explain it makes some sense :)
[01:13:15] <Dark_Shikari> The first functions you write will be something like pixel_avg2
[01:13:15] * BBB forsees a lot of reading asm
[01:13:37] <Dark_Shikari> By the way, for an example of a function very similar to what you will be writing
[01:13:54] <Dark_Shikari> The hpel interpolation in mc-a2.asm is much like what you will be doing
[01:14:07] <Dark_Shikari> e.g. x[-2]*coeff1 + x[-1]*coeff2 + ... x[3]*coeff6
[01:14:11] <Dark_Shikari> + round >> shift
[01:14:23] <BBB> right
[01:14:46] <Dark_Shikari> what CPU do you have?
[01:15:01] <BBB> intel core duo
[01:15:13] <BBB> I don't know exactly what extensions it has, should be quite a lot
[01:15:45] <Dark_Shikari> er, core duo?
[01:15:46] <Dark_Shikari> or core 2?
[01:15:58] <BBB> core duo
[01:16:24] <Dark_Shikari> that's bad.
[01:16:26] <Dark_Shikari> get something better, fast.
[01:16:43] <Dark_Shikari> Core Duo == nothing above SSE2, and even SSE2 is far slower than MMX
[01:16:52] <Dark_Shikari> it microcodes every single SSE2 op by converting it to MMX ops
[01:17:46] <BBB> haha :)
[01:17:51] <BBB> I'm getting a new one in 1-2 months
[01:17:57] <BBB> latest mac whatever
[01:18:01] <BBB> but takes another 1-2 months
[01:18:19] <Dark_Shikari> find a better one on ssh
[01:18:28] <BBB> I can start with mmx, no?
[01:18:39] <Dark_Shikari> true
[01:18:40] <Dark_Shikari> but still
[01:18:51] <Dark_Shikari> btw, FYI, the greatest instruction ever for motion compensation
[01:18:58] <BBB> \o/
[01:19:07] <Dark_Shikari> pmaddubsw
[01:19:08] <BBB> that's exactly what I'll start with then
[01:19:19] <Dark_Shikari> For inputs ABCD ... , EFGH ...
[01:19:22] <Dark_Shikari> output:
[01:19:40] <Dark_Shikari> (A*E + B*F), (C*G + D*H), ...
[01:19:45] <Dark_Shikari> ABCD are uint8_t
[01:19:47] <Dark_Shikari> EFGH are int8_t
[01:20:05] <Dark_Shikari> so first you interleave two sets of 8 bytes
[01:20:12] <Dark_Shikari> then you multiply by two interleaved sets of MC coefficients
[01:20:18] <Dark_Shikari> so for h264, for example, the coeffs are 1 5 20 20 5 1
[01:20:22] <Dark_Shikari> so you can do
[01:20:34] <Dark_Shikari> movq, [src-1]
[01:20:36] <Dark_Shikari> er
[01:20:39] <Dark_Shikari> movq xmm0, [src-1]
[01:20:48] <Dark_Shikari> movq xmm1, [src]
[01:20:51] <Dark_Shikari> punpcklbw xmm0, xmm1
[01:21:07] <Dark_Shikari> pmaddubsw xmm0, {-5,20,-5,20...|
[01:21:14] <Dark_Shikari> s/|/}/
[01:21:24] <BBB> what about alignment?
[01:21:28] <Dark_Shikari> movq requires no alignment
[01:21:34] <BBB> pmaddubsw?
[01:21:41] <Dark_Shikari> {} means "a constant"
[01:21:45] <Dark_Shikari> you can put that in a register
[01:21:49] <BBB> hmm....
[01:21:51] <BBB> ok
[01:21:57] <Dark_Shikari> of course this is an ssse3 instruction.
[01:22:02] <BBB> and save the register for reuse in each row
[01:22:05] <Dark_Shikari> yeah
[01:22:12] <BBB> I guess I don't have ssse3, do I?
[01:22:16] <Dark_Shikari> nope
[01:22:26] <Dark_Shikari> here's a way of doing that in SSE2:
[01:22:30] <Dark_Shikari> movq xmm0, [src-1]
[01:22:32] <Dark_Shikari> movq xmm1, [src]
[01:22:32] <BBB> which is ssse3? punpcklbw?
[01:22:37] <Dark_Shikari> pmaddubsw
[01:22:49] <Dark_Shikari> punpcklbw xmm0, ZERO (some zero register)
[01:22:51] <Dark_Shikari> punpcklbw xmm1, ZERO
[01:23:03] <Dark_Shikari> pmullw xmm0, pw_5
[01:23:07] <Dark_Shikari> pmullw xmm1, pw_20
[01:23:14] <Dark_Shikari> where pw_5 means a constant with repeated 5, of size word
[01:24:06] <Dark_Shikari> oh, and at the end you'd have to do psubw xmm1, xmm0
[01:24:17] <Dark_Shikari> so with sse2, it's 2 loads, 2 unpacks, 2 multiplies, one add
[01:24:25] <Dark_Shikari> with ssse3 it's 2 loads, 1 unpack, 1 multiply
[01:25:31] <BBB> I guess I can do sse2, if mmx is too limiting
[01:25:48] <Dark_Shikari> of course, when I talk about mmx
[01:25:50] <Dark_Shikari> I mean "mmxext"
[01:25:53] <Dark_Shikari> aka mmx + isse
[01:26:02] <Dark_Shikari> a few instructions we've talked about are isse-only
[01:26:04] <Dark_Shikari> for example, pshufw
[01:26:08] <Dark_Shikari> psadbw
[01:26:23] <Dark_Shikari> sse2 is mmx + mmxext in 128-bit registers.  no new integer instructions are in sse2.
[01:26:36] <Dark_Shikari> (other than the ones which are natural generalizations of mmx to 128-bit)
[01:26:41] <Dark_Shikari> no functionally new stuff.
[01:26:49] <Dark_Shikari> so, equally, mmxext is sse2 in 64-bit instead of 128-bit.
[01:27:25] <BBB> and then ssse3 is new stuff
[01:27:38] <BBB> which my poor mac doesn't have because it's 5 yrs old :(
[01:28:21] <Dark_Shikari> more
[01:28:23] <Dark_Shikari> core 2 came out in 2005
[01:28:42] <BBB> yeah, I bought mine in spring, summer was when core2 came out I think
[01:29:18] <BBB> ok, I'm gonna look tonight
[01:29:32] <BBB> I'll have many silly questions tomorrow
[01:30:07] <Dark_Shikari> no problem
[01:31:18] <BBB> now I'll go entertain the wife a bit ;)
[01:31:24] <BBB> thanks for the tutorial!
[01:31:32] <Dark_Shikari> welcome
[01:31:42] <BBB> you should record this on your blog, it's actually really useful, others could learn from this too
[01:31:49] <BBB> not sure how to do the interactivity :)
[01:35:15] <lu_zero> janneg: pong
[06:21:28] <KotH> moin boys
[06:22:14] * Gottaname|Mobili is lost in the ffserver
[06:22:33] * Gottaname|Mobili is replacing the .conf file of ffserver with... mysql
[06:23:12] <Gottaname|Mobili> =3
[06:33:29] * Gottaname|Mobili rolls hyc around
[07:25:14] <av500> gm
[07:48:27] <wbs> superdump: do you have time to give this guy a follow-up on the discussion on libvorbis channel mapping in encoding? http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/2010-June/089909.html
[07:52:22] <superdump> wbs: i'll just check the 7 and 8 channel orders in a minute
[09:32:34] <mru> moroning
[09:33:19] <av500> that bad?
[09:34:42] <mru> morons are everywhere
[09:34:58] <mru> some of them even mormons
[09:40:32] <CIA-98> ffmpeg: mstorsjo * r23636 /trunk/libavformat/http.h: Add the necessary includes, add an extra empty line for cosmetics
[09:56:08] <lu_zero> mru: it's a wonderful day!
[09:56:39] <mru> are you sure?
[09:56:47] * mru is deep in benchmarking hell
[09:57:31] <av500> how does it perfrom against heaven?
[09:57:36] * elenril thinks wonderful days are a lie
[09:58:03] * av500 ignores elenril unless he backs it up with a trope
[09:59:51] <lu_zero> vlc pukes on feng streams
[10:02:20] <elenril> http://tvtropes.org/pmwiki/pmwiki.php/Main/ItGotWorse
[10:02:46] <mru> Dark_Shikari: http://www.macrumors.com/2010/06/17/onlives-gaming-on-demand-service-demoed-on-an-ipad/
[10:09:45] <ohsix> hi; media guys, what can you expect from something doing ac3 -> 5.1 -> ac3 again (then out spdif)
[10:10:02] <kshishkov> some loss of quality
[10:10:13] <ohsix> and how does it change with the source bit rate
[10:10:25] <av500> 5.1 is a codec?
[10:10:39] * mru thought 5.1 was a number
[10:10:50] <mru> he compresses all the audio into a single number
[10:10:50] <kshishkov> av500: yes, but 7.1 is a cooler codec
[10:10:54] <ohsix> are normal people going to notice if the 5,1 -> ac3 is high bitrate?
[10:11:03] <mru> ah, two numbers
[10:11:12] <ohsix> 5,1 to suggest its being decoded to discrete streams
[10:11:13] <kshishkov> mru: so it's stereo
[10:11:35] <av500> ok, so ac3 -> pcm -> ac3
[10:11:42] <ohsix> yea
[10:11:54] <mru> why?
[10:12:19] <kshishkov> maybe to edit and then output to spdif
[10:12:49] <mru> then he should say so
[10:12:54] <ohsix> yea, or if some other stream needs to mix into some channels
[10:13:08] <mru> so the source being ac3 is irrelevant
[10:13:15] <ohsix> but its not important; just asking how to characterize it
[10:13:22] <mru> you're asking how good the ac3 encoder is
[10:13:46] <mru> I can tell you it's better than the vorbis encoder
[10:13:50] <ohsix> an ac3 encoder; if we're talking ffmpeg it suffices
[10:14:02] <CIA-98> ffmpeg: michael * r23637 /trunk/libavfilter/vsrc_buffer.h: add #include so make checkheaders passes
[10:14:20] <mru> it's possible to create an arbitrarily sucky ac3 encoder
[10:14:34] * kshishkov waves in friendly way
[10:14:35] <ohsix> i'm more concerned with the transition in general, and the output bitrate can be as high as possible
[10:14:59] <av500> transition?
[10:15:07] <av500> it will be ac3 ->pcm in any case
[10:15:11] <ohsix> ac3 -> pcm -> ac3
[10:15:14] <mru> ac3 is capable of encoding very good quality
[10:15:16] <av500> unless you look at the md5sums only
[10:15:53] <ohsix> right, not looking for the same bitstream; but what it means for the audio, i'm not at all familiar with ac3
[10:16:07] <av500> ohsix: make the final encode at max bitrate
[10:16:08] <mru> ac3 is a lossy mdct-based encoder
[10:16:34] <mru> it runs up to 640kbps
[10:17:07] <ohsix> just wondering if it will be acceptable to do at all; it'll be a worst case situation, if it shits itself past a reencode then its not even a worst case :P
[10:17:28] <av500> should not
[10:17:32] <ohsix> cool
[10:17:54] <av500> and of course, just try it
[10:18:10] <kshishkov> av500: that'd spoil all the fun
[10:18:16] <ohsix> people are debating how to do ac3 decently in pulse without essentially bypassing it entirely (and not just using pasuspender)
[10:18:24] <mru> lol
[10:18:39] <mru> of course _they_ can't do it properly
[10:19:11] <ohsix> alsa can encode ac3 if it needs to; so pulse can mix in other streams if it has to, and have alsa encode it for output
[10:19:26] <mru> don't trust either of those
[10:19:38] <ohsix> heh
[10:20:03] <av500> pasuspender? lol
[10:20:09] <ohsix> theres no reason it couldn't accept ac3 and decode it; then at least it would interoperate and still act like pulse
[10:20:27] <mru> act like pulse == fail randomly
[10:20:33] <av500> ohsix: the api allows to send non-pcm to pa?
[10:20:40] <ohsix> av500: ya, it'll get it off the alsa devices as long as it runs
[10:20:49] <ohsix> not yet
[10:21:40] * av500 thinks the existence of pasuspender proves that stuff like pa is a fail
[10:21:51] <ohsix> personally i think if people want to use ac3 then they want to bypass pulse already; but people are proposing mutilating it to pass ac3 and do nothing pulse-y while it does so
[10:21:56] <mru> what av500 said
[10:22:04] <ohsix> heh?
[10:22:36] <mru> you don't see xsuspender, do you?
[10:22:44] * elenril eats popcorn
[10:22:46] <mru> that's because X works
[10:22:56] <mru> despite freedesktop and x.org doing their best to break it
[10:22:56] <ohsix> i don't suspend x to have other things use my gpu
[10:23:08] * av500 needs governmentsuspender
[10:23:28] * mru waits for the 5th of November
[10:23:31] <ohsix> and that analogy only works if you propose all existing alsa clients become pulse native clients
[10:23:44] <mru> the root of the problem is alsa
[10:23:48] <av500> +1
[10:23:56] <ohsix> eh?
[10:24:04] <mru> if alsa were decent, all the ugly hacks on top would be unnecessary
[10:24:09] <av500> yep
[10:24:09] <elenril> what is so wrong with alsa?
[10:24:14] <mru> all of it
[10:24:15] <ohsix> what ugly hacks
[10:24:17] <av500> it needs pa
[10:24:24] <mru> ohsix: pulseaudio for starters
[10:24:29] <av500> ohsix: if alsa is not wrong, why do you need pa?
[10:24:37] <ohsix> a lot of people abuse alsa & don't know anything about it, thats not alsas fault
[10:24:43] <av500> of course it is
[10:24:45] <mru> http://blogs.adobe.com/penguin.swf/linuxaudio.png
[10:24:50] <KotH> ohsix: have you ever seen a kernel api spec of alsa?
[10:25:02] <av500> ohsix: apis that are so easily abusable are wrong
[10:25:13] <mru> KotH: they _intentionally_ refuse to document it
[10:25:18] <KotH> exactly
[10:25:22] <elenril> what?
[10:25:26] <mru> I know you know, but maybe not the rest
[10:25:40] <ohsix> av500: you get every facet and toggle in the hardware, thats "wrong"
[10:25:45] <KotH> mru: i know you know i know
[10:25:47] <mru> but at least the alsa devs are reasonably friendly peoply
[10:25:47] <KotH> :)
[10:25:50] <mru> people
[10:26:01] <ohsix> you don't use the kernal api, you use asound :\
[10:26:25] * av500 uses libaudiomixer :)
[10:26:30] <KotH> ohsix: yes.. and when your kernel code doesnt match the lib you have, all hell breaks lose
[10:26:31] <ohsix> maintainers of drivers might need to; but users and client writers shouldn't need to know its even there
[10:26:53] <KotH> ohsix: and because you have no idea what changed in the kernel api, you cannot even know which lib version to use
[10:26:53] <ohsix> that sucks; but i have a distrobution, thats their problem
[10:27:04] <ohsix> couldn't you bisect?
[10:27:09] <KotH> ohsix: i have a distribution too, but i need custom kernels
[10:27:25] <KotH> oh, yea.. bisecting kernel and libalsa everytime i update one of those
[10:27:29] <KotH> thanks.. but NO THANKS
[10:27:36] <mru> kernel policy is to maintain backwards compat as much as possible, except for alsa
[10:27:40] <ohsix> you can check out and build out of tree drivers that were released with your driver version
[10:27:46] <mru> it baffles me how that crap could be mainlined
[10:28:07] <KotH> mru: someone got either deceived or bribed
[10:28:12] <ohsix> kernel doesn't maintain the perf interface; they ship the tool
[10:28:17] <mru> KotH: or both
[10:28:19] <ohsix> its not just alsa
[10:28:30] <av500> most kernel apis are sane and documented
[10:28:51] <ohsix> but essentially alsa is out of tree until its checkpointed; then old drivers are in tree
[10:29:09] <av500> out of tree?
[10:29:13] <mru> kernel features that have exactly one userspace tool often change incompatibly
[10:29:20] <mru> oprofile and such
[10:29:39] <ohsix> its a red herring asking for something to be documented that you're supposed to use through a tool or other defined interface, driver maintainers are the only ones that should rightly care
[10:29:52] <mru> oprofile is still documented
[10:29:55] <ohsix> well, you can read all about oprofile
[10:30:04] <ohsix> and the kernel guys don't like it
[10:30:20] <mru> there are aspects of it I don't like either
[10:30:25] <ohsix> av500: ya, make M=
[10:30:37] <ohsix> i like perf a lot
[10:30:58] <KotH> ohsix: why is it a red hering?
[10:30:59] <av500> ohsix: I know what oot building is
[10:31:05] <KotH> ohsix: it is a kernel interface after all
[10:31:10] <av500> +1
[10:31:17] <KotH> ohsix: heck, even kernel internal interfaces are documented
[10:31:27] <ohsix> KotH: there isn't an abi rev anywhere in the interface? to the very least asound might say "kernel too old"?
[10:31:39] <av500> KotH: it just shows that the interface is not good if it needs to be chagned so often
[10:31:46] <CIA-98> ffmpeg: lucabe * r23638 /trunk/libavformat/rtpenc.c:
[10:31:47] <CIA-98> ffmpeg: Simplify (no need to check for st->codec->extradata) and correct
[10:31:47] <CIA-98> ffmpeg: (extradata_size must be at least 5 bytes) the H.264 MP4 syntax check
[10:31:47] <CIA-98> ffmpeg: in rtpenc.c
[10:32:02] <KotH> ohsix: no, it fails.. somewhere...
[10:32:05] <ohsix> that isn't a universal truth when you're talking about the kernel
[10:32:13] <KotH> ohsix: beside, there is also the case that the kernel is _too_new_
[10:32:39] <ohsix> well if you ever get to digging into it again let me know
[10:32:58] <mru> and often upgrading alsalib breaks apps
[10:33:05] <KotH> ohsix: other example: why does the oss emulation of alsa work better than alsa native?
[10:33:15] <mru> that's hilarious
[10:33:34] <ohsix> does it?
[10:33:42] <KotH> yes
[10:33:42] <mru> frequently
[10:34:00] <ohsix> i'd have to look at what the alsa client is doing for a fair comparison
[10:34:04] <KotH> ohsix: on my laptop (a t42), alsa didnt work at all when i first installed it in 2004
[10:34:10] <KotH> ohsix: oss emu worked fine though
[10:34:18] <ohsix> nice
[10:34:45] <ohsix> it sounds like a ctl/amp node problem than an abi problem
[10:34:59] <KotH> ohsix: when you listen to people in #mplayer, there is at least once a month someone complainign about shuttery sound or a/v sync issues which are magically solved by using the oss emu instead of alsa
[10:35:14] <ohsix> oss can work wheb the elements for an app to use a device correctly from an alsa perspective aren't present
[10:35:29] <av500> KotH: isnt OSS emu that "sane" subset of alsa :)
[10:35:48] <ohsix> well i'm not a superstitious person; i'd find out what their problem was
[10:36:08] <mru> ohsix: we've answered that already: alsa
[10:36:21] <ohsix> a lot of apps break when they write & alsa doesn't block them, since they don't do their own sample rate timiing
[10:36:22] <av500> ohsix: "wheb the elements for an app to use a device correctly from an alsa perspective aren't present" explain
[10:36:56] <ohsix> mru: that means nothing without knowing their distro and how they may have mutilated their config files
[10:37:01] <mru> it's also highly frustrating that alsalib can behave very differently depending on the driver in use
[10:37:24] <mru> ohsix: the problem of alsa in general is alsa
[10:37:30] <av500> ohsix: but how do mutilated config files work in OSS emu?
[10:37:48] <lu_zero> and then you get pulse
[10:37:54] <elenril> mru: write a ffaudioframework
[10:38:05] <ohsix> av500: internal nodes and controls in something like ac97 or hda need to be adjusted for output to actually work, if your driver doesn't have the knob that gets the input anywhere near the output then it wont
[10:38:11] <lu_zero> (currently reaching segfault in libalsa)
[10:38:35] <pross-au> like linux needs yet another audio framework
[10:38:53] <ohsix> av500: they don't; you can look what oss mode does to drivers, its incumbent on software actually being properly written with alsa
[10:38:53] <mru> pross-au: it'll keep needing them until someone creates one that works
[10:38:54] <av500> elenril: at work we had to do that because pulse would not work on an arm9 3ys ago...
[10:39:06] <lu_zero> pross-au: currently pulse is just brain damaged about the idea of "not a system wide daemon because I'm narrow minded"
[10:39:19] <pross-au> mru: oss
[10:39:20] <elenril> av500: "pulse would not work" sounds normal
[10:39:31] <mru> pross-au: oss has its drawbacks
[10:39:34] <av500> ok, so it was not us :)
[10:39:39] <ohsix> the root problem is you can open an alsa device and play samples, and not get sound, thats not alsas fault
[10:39:40] <av500> oss4 ftw!
[10:39:49] <mru> it's hard to get accurate sync with oss
[10:39:52] <lu_zero> oss4 seems fragile...
[10:40:01] <pross-au> oh did not know that
[10:40:03] <av500> mru: worked fine for us in the past...
[10:40:19] <mru> maybe for you (your customers)
[10:40:29] <ohsix> lu_zero: its not narrow minded; you can't have uses loading modules if it doesn't run as them, and it handles consolekit handoffs
[10:41:10] <mru> any mention of *kit automatically reduces your credibility in my eyes
[10:41:25] <ohsix> mru has a point about oss; a broke app will be broke, and oss can't always block the app to keep it delivering samples at somewhat the right rate
[10:41:44] <ohsix> heh, all it does is apply acls when the owners change
[10:41:47] <mru> I'm not talking about blocking vs nonblocking writes
[10:41:56] <ohsix> i'm not either
[10:42:11] <mru> sure sounds like it
[10:42:14] <ohsix> talking about bad apps that happen to work
[10:42:44] <pross-au> html5 audio
[10:42:50] <mru> with a sane api it would be trivial to simply open the device and play some samples
[10:42:53] <mru> with alsa it is not
[10:42:54] <ohsix> the pcm cursors and wakeups are completely arbitrary between devices, its not an oss or alsa problem
[10:43:21] <ohsix> well alsa isn't for that :P doesn't diminish what its there to do
[10:43:43] <ohsix> you can use libao or portaudio if thats what you are looking for
[10:43:50] <mru> blegh
[10:44:02] <mru> I want _fewer_ wrappers, not _more_
[10:44:11] <elenril> wrappers all the way down!
[10:44:17] <elenril> this is the only future
[10:44:36] <ohsix> you just need one; one that is a library, a system call interface is hard to virtualize
[10:44:46] <mru> eh what?
[10:45:00] <mru> I'd be happy to mmap the control registers :-)
[10:45:22] <ohsix> but saying you want fewer wrappers to also say that something should work how you wan't, not how it does is neither here nor there
[10:45:49] <mru> the base interface should expose the hardware as is
[10:46:02] <ohsix> thats what alsa does
[10:46:04] <mru> no
[10:46:18] <mru> alsa does all manner of mutilation between you and the hardware
[10:46:23] <mru> resampling etc
[10:46:27] <ohsix> yea, you open hw: you get it
[10:46:37] <ohsix> no configs or plugins involved
[10:46:39] <mru> still too many layers
[10:46:59] <ohsix> if you open hw:, there are none but the link to asound
[10:47:12] <mru> still too much cruft
[10:47:54] <ohsix> the point is you can use other labels and have your integrator give you a uniform interface; like a label surround51, or spdif that do what the user expects
[10:48:18] <mru> who do you think "your integrator" is?
[10:48:37] <mru> this "someone else will fix it" attitude is most disturbing
[10:48:40] <ohsix> the person writing the software you use and put it together
[10:48:49] <ohsix> i fix my problems; thanks
[10:48:55] <mru> look dude, I'M WRITING THE SOFTWARE
[10:49:03] <ohsix> but my grandmmother need not know or care
[10:49:16] <mru> I'm not talking about your grandmother
[10:49:51] <ohsix> and if i am? her integrator solved such things so she doesn't have to
[10:50:12] <mru> why not try to make the integrator's life easy?
[10:50:23] <mru> such as by making things work sanely in the first place
[10:50:37] <ohsix> because its not an easy job; and people need to know what they're doing anyways
[10:50:49] <ohsix> if they don't they shouldn't be doing it
[10:50:55] <mru> you're describing how it is, not how it should be
[10:51:04] <ohsix> it works quite sanely
[10:51:15] <kshishkov> open("/dev/dsp", O_WRONLY); ioctl(dspfd, SET_SAMPLING_RATE, &rate); ...
[10:51:28] <ohsix> and you are saying how its bad without proposing anything, whats better?
[10:52:05] <elenril> kshishkov: yeah, let's add moar ioctls
[10:52:05] <mru> even v4l2 is better in some ways
[10:52:12] <ohsix> kshishkov: opening /dev/dsp doesn't make proper audio software any more than -lasound2 does
[10:52:20] <mru> at least it lets you query the actual capabilities of the hardware
[10:52:56] <ohsix> kshishkov: but what you will find; in people using oss _and_ alsa, that they left the "proper" part out anyways
[10:53:19] <ohsix> and why not???// if it works on their machine :]
[10:53:45] <ohsix> mru: alsa does too
[10:54:06] <ohsix> open hw: and you'll be talking to the device (pcm and ctl)
[10:54:29] <ohsix> but opening hw: is bad juju
[10:54:52] <mru> ok, so how do I query the supported sampling rates?
[10:54:59] <mru> and the buffers size
[10:55:20] <ohsix> all you need care about is if the device you;re talking to is reporting information you need in a reasonably correct manner; if it is hw:, default:, or pulse: or surround51
[10:55:42] <ohsix> well i'd point you at amixer
[10:56:26] <ohsix> i could find the documentation but thats more work than i care for at the moment, and amixer will display everything that can be read
[10:56:59] <ohsix> "amixer contents"
[10:57:06] <mru> amixer is _not_ the solution
[10:57:14] <mru> I want a C interface to ask for the info
[10:57:17] <mru> there isn't one
[10:57:37] <ohsix> then amixer must be magic
[10:58:13] <mru> where does amixer report all supported sampling rates for a device?
[10:58:24] <mru> and min/max buffer size
[10:58:28] <mru> interrupt rate?
[10:58:41] <mru> supported channels?
[10:58:47] <mru> sample formats?
[10:58:56] <mru> amixer controls the mixer settings
[10:59:00] <mru> totally different thing
[10:59:08] <ohsix> http://git.alsa-project.org/?p=alsa-utils.git;a=blob;f=amixer/amixer.c;h=c9ea57204f09a2021dc843cb3a1c113e0876e635;hb=HEAD#l679
[10:59:33] <mru> that's enumerating the mixer controls
[10:59:41] <mru> nothing to do with what I said
[10:59:42] <ohsix> oh, ok then i misunderstood; that will only show the knobs, moment
[10:59:54] <mru> I'm not talking about the goddamn mixer
[10:59:58] <mru> I don't care about the mixer
[11:00:41] <CIA-98> ffmpeg: maxim * r23639 /trunk/libavformat/ (Makefile oma.c): Add metadata support. Patch by Michael Karcher.
[11:01:13] <ohsix> before i look; is the "interrupt rate" something you really need to write proper audio software? there are 14mhz interval timers available on pc's since 2004
[11:01:26] <mru> yes
[11:01:38] <mru> what alsa calls "period"
[11:01:58] <ohsix> but you know the sample rate; you just want to set/know the watermark
[11:01:59] <mru> the sound hw generates an interrupt ever N bytes in the buffer
[11:02:06] <ohsix> right
[11:02:08] <mru> watermark is not the same
[11:02:22] <ohsix> but those interrupts are not reliable
[11:02:28] <mru> wtf?
[11:02:31] <mru> they are ESSENTIAL
[11:02:38] <ohsix> are ... you for real
[11:02:48] <mru> more real than you could ever imagine
[11:03:23] <lu_zero> (and scary)
[11:03:29] <ohsix> you know and are tracking the output rate; you know where the cursor is better than the device can, and you can handle it a lot sooner than waiting for a wakeup
[11:03:56] <mru> you obviously have no clue about how audio hardware works
[11:04:02] <ohsix> heh
[11:04:07] <mru> that interrupt _is_ your "cursor"
[11:04:10] <ohsix> i do
[11:04:43] <mru> and setting the right interval is important
[11:04:51] <ohsix> interrupts are subject to the OS and software environment; why would you use them when your os & app can wake up on an itimer at just the right time?
[11:05:06] <mru> because the itimer isn't controlled by the sound card, silly
[11:05:40] <mru> so once again, you've obviously never had to deal with things like a/v sync
[11:05:44] <ohsix> do you derive a real sample clock in your app or assume it follows your sample rate with some low error in ppm?
[11:05:56] <mru> eh?
[11:06:14] <mru> the DAC clock is all that matters
[11:06:17] <ohsix> i've done enough real time software to know how to reify disprate and inaccruate clocks :]
[11:06:19] <mru> even if it's off by 10%
[11:06:30] <mru> there is only one clock
[11:06:47] <ohsix> if its off by 10% and the app is playing 44,1khz in _real time), what happens on an xrun?
[11:07:00] <mru> real time is whatever the dac clock says
[11:07:07] <ohsix> ehh
[11:07:23] <mru> since pc hardware doesn't let me control that clock
[11:07:29] <ohsix> real time is wall time since my video clock is nonintegral
[11:07:49] <mru> I repeat, you don't know what you're talking about
[11:07:57] <ohsix> and my 14mhz interval timers will wake me up on happy buffer time
[11:08:12] <mru> say hello to Mr Drift
[11:08:29] <ohsix> alright; propose your have your dac clock and its off by 10%
[11:08:44] <mru> music will sound horrible, sure
[11:09:12] <ohsix> say you have another one and its off by 10% too; and you need to output with some degree of signal jitter to both the best you can
[11:09:21] <mru> what "another one"?
[11:09:46] <ohsix> another device, another time domain (but given the situation, an audio device is fine)
[11:09:55] <mru> then you have a problem
[11:10:07] <mru> if you're dealing with multiple audio devices, you really need some way to sync their clocks
[11:10:10] <ohsix> how do you make it not a problem?
[11:10:16] <ohsix> right!
[11:10:26] <mru> which pc hardware doesn't let you
[11:10:38] <mru> I'm talking about playing audio and video in sync
[11:10:39] <ohsix> and say you have a clock thats high resolution and has low ppm error rate
[11:10:45] <ohsix> i know
[11:10:55] <ohsix> i'm talking about real time software
[11:11:04] <mru> the sound will play at whatever rate it chooses, and I can't do a thing about it
[11:11:06] <ohsix> of which multimdia is a subset
[11:11:25] <mru> to maintain sync, I must use this as the master time and display video frames accordingly
[11:11:38] <mru> regardless of wallclock time
[11:11:43] <ohsix> well you can, unsynced sample clocks will only impact jitter to a degree that you can measure and minimize
[11:12:07] <mru> independent, freerunning clocks will drift apart sooner or later
[11:12:20] <ohsix> the output might differ but you can make it such that by your contrived all time it only amounts to jitter
[11:12:27] <ohsix> ya they will
[11:12:57] <mru> so without a way to tie them together, you have an unsolvable problem
[11:13:05] <mru> and discussing it further is pointless
[11:13:14] <ohsix> thats why you track and smooth them so you know their real rate if you need to reason about them in terms of how much they drift with relation to eachother
[11:13:38] <mru> but that's utterly irrelevant to this discussion
[11:13:48] <ohsix> heh its not an unsolvable problem; its deciding how much jitter is acceptable and factoring it out
[11:13:48] <mru> which was about writing a video player using alsa
[11:14:08] <mru> jitter is managable, uncontrollable drift is not
[11:14:14] <ohsix> its relevant; the video is its own clock domain; as is the audio
[11:14:35] <ohsix> yes but you have a higher resolution clock to resolve the drift
[11:14:35] <lu_zero> ohsix: ...
[11:15:04] <ohsix> there will be a period where you can duplicate frames to keep them from drifting too much (like jitter)
[11:15:05] <mru> with a 60Hz vsync you'll obviously have some jitter in the video display
[11:15:37] <ohsix> as a timebase vsync isn't very useful :]
[11:15:58] <mru> if the video clock runs faster than the audio clock, a frame will be displayed one vsync interval longer once in a while
[11:16:22] <mru> if it runs slow, you'll be one vsync short from time to time
[11:16:25] <mru> unavoidable
[11:16:35] <ohsix> indeed, most overlay engines optionally sync on vsync
[11:17:04] <mru> that's irrelevant
[11:17:24] <mru> what's relevant is that the 15ms glitch isn't visible
[11:17:25] <ohsix> but you aren't reifying the 60hz time domain with the 48khz one; you're clocking the wall time rate of 23.4543whatever fps with the sample clock rate
[11:17:45] <ohsix> it was as relevant as mentioning vsync in the first place; but lets not digress
[11:18:09] <mru> digress is all you do
[11:18:53] <ohsix> sometimes your video presentation rate will be offl but you can also schedule early frame delivery if you think one is going to cross a sync interval; then you will be a partial frame ahead instead of a whole frame behind
[11:19:18] <mru> the end result is exactly the same
[11:20:04] <ohsix> not really; you can't split a frame
[11:20:09] <mru> playing 24fps video on a 60Hz display has a bit of jitter
[11:20:13] <mru> there's no way around it
[11:20:31] <ohsix> but if you do it early you can minimize the error
[11:20:36] <mru> not necessary
[11:20:46] <mru> the error is at most 15ms
[11:21:31] <mru> but that's a separate problem
[11:21:37] <mru> has nothing to do with the audio clock
[11:21:43] <ohsix> but you could schedule early, why not?
[11:22:10] <ohsix> i tried to make a larger point about clock domains and real time software ... :[
[11:22:13] <mru> sure, subtract a constant from all pts, job done
[11:22:36] <ohsix> theres a reason the magic juice is lacking from a lot of stuff on linux that makes sounds
[11:22:39] <mru> the discussion was about how alsa sucks
[11:22:54] <ohsix> eh
[11:23:01] <mru> minimising video jitter has nothing to do with alsa suckage
[11:23:45] <kierank> wow, you've been going for an hour
[11:24:54] <ohsix> i thought this was the discussion; 03:39 <@mru:#ffmpeg-devel> it's hard to get accurate sync with oss
[11:25:17] <ohsix> it certainly went on how oss wasn't alone in that regard
[11:25:52] <mru> to sync something against the audio clock, you need some way to read it
[11:25:56] <mru> oss doesn't offer that
[11:25:57] <ohsix> minimizing error, not jitter; like you said, 24hz != 60hz, you'll get jitter
[11:26:39] <ohsix> but you know your sample rate is 44.1khz, why do you need to know how fast the cursor is going (assume for a moment that you can actually do that)
[11:26:50] <mru> the alsa timer interface gives a reasonably accurate notification each time a "period" is crossed
[11:27:12] <ohsix> it is still unreliable; and bodged in many drivers
[11:28:15] <mru> which is one reason alsa sucks
[11:28:18] <ohsix> the important parameters you need to know, alsa nor oss will tell you, you need to measure the sample clock (for jitter and error(
[11:28:29] <mru> untrue
[11:28:44] <ohsix> alsa is reporting what the hardware tells it
[11:28:47] <mru> the sound clock is the reference, so there's no need to measure it
[11:28:59] <KotH> are you still at it?
[11:29:16] <ohsix> if the watermark is invalid or the interrupt is delivered late (by some period you'd need to know) that is the card
[11:29:30] <mru> watermark is irrelevant
[11:29:41] <KotH> have you written anything worth knowing? or shall i just skip as a discussion between the master of trolls and a noob?
[11:29:53] <mru> KotH: nothing noteworthy
[11:29:58] <ohsix> in some respects oss is better in that regard; its not providing information you shouldn't know and is unreliable anyways
[11:30:21] <mru> it's only unreliable in certain reference frames
[11:30:38] <mru> nothing is unreliable with _itself_ as reference
[11:30:59] <ohsix> yes; if you can be sure the card is doing it correctly (though this is unspecified!) you might be able to use it as a time base
[11:31:33] <mru> I've been using the sound card as time base for many years
[11:31:36] <mru> works like a charm
[11:31:40] <ohsix> and some cards synthesize those events from kernel buffers; they work differently as well
[11:31:50] <ohsix> until it doesn't :P
[11:32:03] <mru> you could say that about anything
[11:32:41] <ohsix> clocks have a quality masure (Q) and if you're going to pick one you need to either knock one into shape and know why a clock is bad, or derive a locked loop with respect to another clock that you know the Q on
[11:32:51] <KotH> ohsix: if you want to know how high precision+accuracy synchronisation of clocks is done, join the time-nuts mailinglist
[11:32:52] <mru> and once again, with a 60Hz display, pursuing jitter less than 15ms is pointless
[11:33:08] <ohsix> but in audio and video timeframes, 14mhz is enough
[11:33:14] <mru> KotH: this is not about synchronising clocks
[11:33:21] <ohsix> we're talking about reclocking audio, not video
[11:33:25] <mru> eh, no
[11:33:41] <KotH> ohsix: there are people who sync their sound cards to Cs frequency standards for high precision measurement of audio signals :)
[11:33:55] <ohsix> minimizing visual artifacts that are unpleasing to the eye is completely different than minimizing artifacts that are unpleasing to the ear
[11:33:59] <mru> KotH: I bet that requires some rather specialised sound cards
[11:34:04] <KotH> mru: nope
[11:34:20] <KotH> mru: just one where you can desolder the crystal
[11:34:21] <mru> I've never seen a sound card with external clock input
[11:34:26] <mru> oh, soldering...
[11:34:27] <ohsix> you can measure and tune any old pci sound card
[11:34:38] <ohsix> and add external syncs and stuff
[11:34:46] <mru> I have piles of old sound cards...
[11:34:56] <KotH> ohsix: measuring and tuning wont help if you want to go better than 10^-5
[11:34:57] <mru> and some old gps receivers
[11:35:05] <av500> 4) profit
[11:35:09] <KotH> ohsix: and these guys are doing 10^-10 measurements
[11:35:49] <mru> KotH: these the guys who carry atomic clocks to mountaintops just for fun?
[11:35:52] <ohsix> but anyways; an interval timer can wake up your app to write to your pcm just in time (which you can measure due to circumstance in your app preparing sound and the system status), you can literally be writing at the exact moment you cross the water mark, down to 1 or 2 samples; you cannot do that with timing information from the sound card
[11:35:55] <KotH> mru: exactly those
[11:36:00] <mru> ohsix: DRIFT!!!!
[11:36:02] <ohsix> ya i know
[11:36:19] <mru> if you do indeed know, you're sure as hell not acting on that knowledge
[11:36:34] <ohsix> mru: drift, constant or varying? varying to what degree? drift is real, yes; but you have one clock who's frequency is so much higher than the time domain you're interested in
[11:36:51] <mru> I'm not trying to get low latency
[11:36:55] <mru> that's a different problem
[11:37:06] <KotH> ohsix: a normal crystal has 3 types of driffts: temperature, age, movement
[11:37:14] <ohsix> ah well if you know how to say it like that then you already know what i'm explaining
[11:37:25] <ohsix> but i propose doing it "right" is always doing it real time
[11:37:27] <mru> KotH: the issue here is drift between the DAC clock and other clocks in the system
[11:37:35] <KotH> ohsix: and a few types of noise sources: intrinsic, circuitry, temperature, vibration,....
[11:37:38] <mru> such as the main system clock and the vsync clock
[11:38:03] <KotH> lol
[11:38:18] <ohsix> KotH: i know, but consider the domain you're working in; those are all big huge large changes that would indeed affect your output if you weren't tuning it in real time for drift as you measured it
[11:38:19] <mru> if you have a freerunning DAC and time your writes based on another clock, you'll overflow or underflow eventually
[11:38:36] <KotH> mru: so it's about perfectly syncing audio to video?
[11:38:38] <mru> a clock has no drift relative to itself
[11:38:55] <ohsix> not so, you can query for cursor position and know with some reliability where it has been in the past; it is different from having the same information wake you up to deliver samples
[11:38:56] <mru> and 10ms jitter in video is not a problem
[11:39:50] * KotH just drops in the pice of information that most clocks in a pc are in the quality ball park of an R-C oscillator
[11:39:51] <mru> I never mentioned any detail about how I'd be using those interrupts
[11:40:06] <mru> KotH: that's because many of them _are_ RC oscillators
[11:40:14] <KotH> mru: actually no
[11:40:36] <KotH> mru: most of them are low quality crystals with high jitter/noise circuitry
[11:40:38] <ohsix> the problem with cursors is waking up on them or using them for timing, since all cards vary in their misfunctinoality in how they do it, however when you query the position you know that it is sometime in the past, and that while there may be jitter it will always be somewhat in the past, and you can filter the jitter, so you have a stable input into your locked loop
[11:41:02] <av500> err
[11:41:19] <mru> and how the fuck would you know where you are without the interrupts?
[11:41:19] <KotH> av500: speak your mind, cause i dont dare to read that last utterance
[11:41:30] <ohsix> you read it
[11:41:55] <mru> from what?
[11:42:02] <ohsix> theres no reason for the card to wake anyone up for you to read where it is; theres a big difference between waking up on it and using it as an error input
[11:42:23] <ohsix> the pci config space most likely; i imagine some drivers shadow what they've been told in an interrupt context as well
[11:42:27] <mru> I never said anything about waking up
[11:42:45] <mru> and pci config space is not the answer
[11:42:50] <mru> do you know anything about pci?
[11:42:56] <KotH> er...
[11:43:04] <KotH> pci config space is definitly not an answer to anything
[11:43:06] <ohsix> well not the config space; you know what i mean
[11:43:14] <mru> not really
[11:43:19] <ohsix> somewhere bar+n
[11:43:25] <mru> you're so misinformed it's hard to predict what you might meant
[11:43:40] <mru> who said this was a pci device?
[11:43:51] <ohsix> must reflect poorly on you when i said we were talking about the same thing :]
[11:43:55] <ohsix> nobody
[11:44:03] <mru> and we're in userspace here
[11:44:16] <mru> so don't make that assumption
[11:44:24] <ohsix> it just so happens that any given pci device is probably good enough for discussion; making up contrived devices just to put one over on someone is antithetical
[11:44:31] * KotH thought we are in cyberspace
[11:44:40] <mru> I'm not making anything up
[11:44:47] <mru> just pick your favourite SoC
[11:45:01] <KotH> ac97 or I2S?
[11:45:16] <ohsix> are you asking me to detail how you might get said information from alsa?
[11:46:01] <ohsix> i'm not sure where the disconnect is; i mentioned that it might be in the pci space for the device, or the driver might shadow it from interrupt context, you asked where the information might be
[11:47:17] <ohsix> and overall its quite a departure from me showing you where to get the sample rate and whatnot from alsa
[11:49:17] <mru> which you still haven't done
[11:49:38] <ohsix> ya, gonna do that as soon as i'm done typing
[11:49:51] <mru> as usual to dragged off on a tangent in an attempt to avoid admitting you don't know the answer
[11:50:01] <ohsix> i asked about how you would like to know some info that was marginally useless anyways; and i don't know if alsa reports, and it kind of spun out
[11:51:41] <ohsix> i know where you can get all that information from proc; but that isn't what you're looking for :]
[11:52:20] <mru> that info is nowhere in proc
[11:54:43] <mru> it's obviously available somewhere in the drivers
[11:54:44] <ohsix> alsa is kind of incomprehensible, it would be nice if they had a block diagram and how the configuration files and labels came into play
[11:54:47] <mru> so why can't I query it?
[11:54:59] <ohsix> i duno; i'm looking
[11:55:03] <mru> "alsa is kind of incomprehensible"  <--- *that* is the problem
[11:55:22] <ohsix> well i understand it enough; but people don't put in the effort when they write clients
[11:55:30] <av500> that is the issue
[11:55:37] <av500> if the api allows that, it is bad
[11:55:38] <mru> the effort is too great
[11:55:45] <ohsix> yea i get that
[11:55:53] <ohsix> but what it does is complex; but conceptually simple
[11:56:08] <ohsix> fwiw if any of you have worked with the Maya plugin api; it would be cool if alsa worked like that
[11:56:22] <mru> have you ever looked at the specs for an actual sound card?
[11:56:25] <ohsix> with binding function sets and facets/aspects of the object you're working with
[11:56:27] <mru> they're much simpler than alsa
[11:56:30] <ohsix> sure?
[11:56:59] <ohsix> well if you're tweaking registers sure; but how do you embody those registers for general use by different software and on different cards
[11:57:27] <mru> they all support the same basic settings
[11:57:30] <ohsix> you have abstractions like switches; a filter graph for ac97/hda that can be manipulated, its complex to present a uniform interface
[11:57:50] <ohsix> yea, but one of the big arguments against oss and for alsa back in the day was that people wanted to use their hardware
[11:58:04] <mru> alsa policy seems to be to make everything as complicated as the most complex operation imaginable
[11:58:07] <ohsix> providing a normalized mixer interface on top of oss is tough stuff
[11:58:20] <mru> I'm not talking about mixers
[11:58:38] <ohsix> i know; mixers are just one problem that alsa purports to solve with all the different control objects
[11:58:51] <ohsix> and a situation most users of oss would probably understand ...
[11:59:00] <ohsix> if i'm being patronizing just say so
[11:59:08] <mru> the mixer interface is horrible in both alsa and oss
[11:59:14] <mru> just in different ways
[11:59:36] <ohsix> well what you see in alsa is a best attempt at normalizing some stuff that controls a "ctl" often not even a hardware ctl
[11:59:50] <mru> but forget about the mixer
[11:59:56] <ohsix> but at best you get mute switches and source/sink switches with alsa
[12:00:05] <mru> a normal playback app has no business messing with mixers anyway
[12:00:50] <ohsix> with ac97/hda the pin complexes form a graph instead of a flat interface with knobs; and part of getting it to work in a new driver is connecting new elements up in there, the same attenuator/switch settings are nodes in that graph instead of the flat interface
[12:01:18] <av500> that is not why alsa apps struggle
[12:01:25] <av500> most would not even care
[12:01:44] <av500> i agree there might be complexity
[12:01:46] <ohsix> i'm just saying, it bears out all this complexity in a uniform, if obfuscated interface
[12:02:02] <mru> uniformly obfuscated
[12:02:10] <av500> obfuscatedly uniform
[12:02:31] <ohsix> iwth a normalized oss interface you'd have to do well to hide all those internal elements and not everyone would agree what the user edifice for that particular device is
[12:02:53] <mru> WE'RE NOT TALKING ABOUT MIXERS
[12:03:23] <mru> now please answer one simple question: how do I query the supported sample rates of a given hardware device?
[12:03:33] <av500> from webm: Whats the status of the 0.9.1 release? The just released ffmpeg-0.6 does not build against libvpx-0.9.0
[12:03:38] <ohsix> i was talking about the complexity, needed; but possibly presented poorly, i can't speak more plainly :[
[12:03:46] <ohsix> mru: still looking
[12:03:56] <av500> 0.6 is already outdated...
[12:04:07] <mru> ohsix: I know straight-forward speech is a challenge for you
[12:04:23] <ohsix> at least i manage without ad hominem :]
[12:04:30] <wbs> av500: isn't it the other way, ffmpeg doesn't build against the outdated 0.9.0 release of libvpx, since they changed stuff after the 0.9.0 release?
[12:04:35] <mru> that was just an observation
[12:04:38] <av500> wbs: no idea
[12:04:47] <mru> wbs: yes
[12:05:02] <ohsix> well i tend to keep observations that demean someone i'm trying to discuss something with to myself
[12:05:05] <av500> but ppl will blame 0.6
[12:05:12] <mru> ohsix: oh really...
[12:07:06] <Honoome> av500: people always blame ffmpeg… that's no news
[12:07:31] <Honoome> plus half the people out there will _never_ blame google: they are The Light… nothing else matters, no?
[12:07:44] <mru> not that simple
[12:07:56] <mru> until the vp8 release, google was Evil(tm)
[12:08:18] <Honoome> mru: no it wasn't, and of course it was FSF single-handedly that convinced Google to release it…
[12:08:37] <mru> eh, what about the streetview steals your soul stuff?
[12:08:45] <mru> and the wifi sniffing
[12:08:47] <av500> well, still might be worth to add that to 0.6 release notes....
[12:08:51] <mru> and the ad tracking
[12:09:03] <mru> and everything else they were lambasted for
[12:09:04] <av500> and keeping a copy of the WHOLE internet?
[12:09:04] <Honoome> mru: I'm being facetious here if you couldn't tell
[12:09:29] <mru> facetious is a good word
[12:09:39] <mru> it has all the vowels exactly once in alphabetical order
[12:09:56] <ohsix> mru: you can't read them directly; but you can exaustively try seeing if something is a valid hw configuration
[12:10:14] <av500> exaustively  is not good
[12:10:20] <av500> it misses an "o"
[12:10:25] <ohsix> kind of like PIXELFORMATDESCRIPTOR and stuff for opengl on windows
[12:10:31] <mru> ohsix: and that is precisely the reaon why it sucks
[12:10:37] <ohsix> dunno
[12:11:03] <ohsix> you should just set the pcm you think your app wants; if it doesn't match hw: you cant open the device, but if you open plughw:hw, alsa will resample for you
[12:11:24] <mru> bad, bad, bad
[12:11:34] <mru> I don't want alsa's crappy resampler
[12:11:42] <ohsix> then do it yourself and open hw:
[12:11:47] <ohsix> or don't
[12:11:59] <mru> but how can I know what to resample if I can't query the supported rates?
[12:12:09] <ohsix> even if you knew the sample rate and stuff; you wouldn't know what configuration is valid until you tried to set it
[12:12:24] <ohsix> you try rates you are willing to resample to until hw: opens
[12:12:36] <mru> that's stupid
[12:12:50] <mru> there are thousands of possible rates
[12:13:07] <mru> anything from 8k to ~200k is reasonable to expect
[12:13:17] <ohsix> yea? and you pick one heh
[12:13:32] <mru> why can't I just get a list of ranges the hw supports?
[12:13:41] <ohsix> i'd pick 44.1, 48k, 96k, 192k
[12:13:51] <mru> well, those are likely
[12:13:55] <ohsix> indeed
[12:14:02] <mru> but that's not the point
[12:14:06] <mru> it's a stupid interface
[12:14:07] <ohsix> 32khz as well
[12:14:10] <mru> and 24
[12:14:13] <mru> and 11.025
[12:14:16] <mru> and 8
[12:14:30] <mru> and 22.05
[12:14:33] <ohsix> i don't think so; i think it works well, you can't apply all the hw params over all sample rates for all devices
[12:14:33] <mru> etc, etc
[12:14:59] <ohsix> like on some card out there you'll have to lower the bit depth to do 192khz, how do you convey that
[12:15:10] <mru> ok, I'll rephrase: why is alsa hiding information from me?
[12:15:23] <ohsix> its not; you can see it in /proc/asound
[12:15:31] <mru> that's not a proper api
[12:15:44] <ohsix> what you cant see in /proc/asound, or in any library call; is which combination of parameeters are valid; not all of them are
[12:16:01] <mru> yet alsa internally knows
[12:16:05] <mru> so why can't I find out?
[12:16:16] <ohsix> it often doesn't know until it tries to apply the configuration
[12:16:28] <ohsix> it knows lists of possible parameters, but not what combination of them is valid
[12:16:39] <mru> so why can't I get those lists?
[12:16:46] <mru> it would reduce the combinations for me to try
[12:17:05] <ohsix> you can, but you will still have to try and configure the pcm to know if the config is valid; and you're at no better a position than when you didn't know any of that information
[12:17:17] <mru> v4l2 can enumerate supported pixel formats and resolutions
[12:17:22] <mru> all with a simple ioctl
[12:17:46] <ohsix> v4l doesn't have to try a disprate set of configuration bits to see if it makes a valid hardware configuration, it can list them
[12:17:49] <mru> if I know the hw doesn't support some particular sample rate _at all_ there's no point trying that one
[12:17:57] <ohsix> right
[12:18:10] <mru> same for the other parameters
[12:18:39] <ohsix> but the point is your software should be able to handle a huge range of sample formats if its going to resample itself; you would pick what you prefer first
[12:19:11] <ohsix> sample formats/rates
[12:19:25] <mru> I prefer to not resample at all
[12:20:01] <ohsix> the order that i'd probably do it was to check the bit depth i want first; then the sampling rate, and i'd only have a few candidates to check; and thats if i wasn't just going to use 44.1khz and 16bit, and if i didn't mind plughw doing the connecting for me
[12:20:14] <ohsix> of course; then you'd just generate at the rate you find acceptable
[12:20:27] <ohsix> or the only rate possible for your app
[12:20:44] <ohsix> its not a significant restriction; and it easily covers invalid configurations cleanly
[12:20:55] <mru> I mind plughw, ok
[12:21:11] <ohsix> ok, people aren't opening hw: though, my grandma isn't
[12:21:27] <mru> your grandma is irrelevant
[12:21:33] <mru> she's not writing alsa apps
[12:21:35] <ohsix> i have one piece of software on my machine that might rightfully open hw: and mind plughw, and that's jack
[12:21:56] <mru> anything that cares about quality should stay well away from plughw
[12:22:07] <ohsix> quality is subjective
[12:22:21] <ohsix> resampling is objectively bad, for sure
[12:22:29] <ohsix> but the common case is there is no resampling
[12:22:40] <ohsix> and if plughw is in play its just shuffling stuff around
[12:22:59] <mru> resampling is always worse than not resampling
[12:23:13] <mru> and plughw resampling is worse than doing it yourself with a proper resampler
[12:23:27] <ohsix> yes, but wether the output is acceptable and fit for purpose is bujective
[12:23:45] <ohsix> i appreciate that you don't want to
[12:23:58] <ohsix> and thats a matter of just not opening plughw:hw, but opening hw:
[12:24:26] <ohsix> everything i use here commonly uses default: and there is no resampling
[12:26:01] <ohsix> the set-and-check api is cumbersome but it papers over well the fact that the expression space for the configuration can generate a _lot_ of invalid configurations
[12:26:17] <mru> I don't want proplems papered-over
[12:26:20] <mru> I want them solved
[12:26:26] <ohsix> its not papered over
[12:26:30] <mru> you said so
[12:26:34] <ohsix> you try some parameters, you know if they are valid
[12:26:46] <ohsix> the interface simply makes it less ugly, thats all
[12:27:12] <ohsix> knowing all the knobs in the configuration space will not tell you which are valid until they are set; thats the disconnect that makes it a useful method
[12:27:36] <mru> knowing the valid values of each upfront drastically reduces the search space
[12:27:43] <ohsix> it might not even be known which configurations are valid
[12:27:45] <mru> particularly if I have additional constraints
[12:27:50] <ohsix> sure; but practically speaking, you do
[12:28:11] <ohsix> 24 bit, 16 bit, 32khz, 44.1, 48, 96, 192
[12:28:45] <ohsix> there are cards that'll do fractional sample rates too; they're represented by a range of possible sample rates instead of discrete steps
[12:28:53] <mru> yes
[12:28:57] <mru> so tell me what they are
[12:29:03] <ohsix> i think simply asking for what you want and likely receiving it is good enough
[12:29:14] <mru> I'm telling you it's not
[12:29:30] <mru> what I want depends on what I can have
[12:29:35] <ohsix> not all configurations are valid; and you don't know until they're set
[12:29:48] <mru> do you know why restaurants have a menu?
[12:29:49] <ohsix> you ask for the best that you can get first, why would you ask for any less?
[12:30:04] <mru> it's to save you time asking for dish after dish until you find one they'll cook
[12:30:07] <ohsix> restaurants don't sell a wide range of discrete numbers
[12:30:23] <ohsix> and i haven't been to one that wont cook off the menu
[12:30:27] <ohsix> heh
[12:30:31] <twnqx> Oo
[12:30:34] <twnqx> not off the menu?
[12:30:35] <mru> only with the ingredients they have
[12:30:41] <twnqx> even mcdonalds offers that
[12:31:01] <twnqx> "one hamburger without pickles please"
[12:31:06] <mru> any decent place will customise the food of course
[12:31:10] <ohsix> twnqx: i mean for like; asking for fried eggs at a place that doesn't serve breakfast
[12:31:32] <mru> if they have eggs they might do it
[12:31:38] <mru> if they don't have any eggs, tough
[12:31:43] <ohsix> they will do it, in my experience
[12:31:53] <mru> only if they have eggs
[12:32:17] <ohsix> mru: when i sit down the waitress doesn't exaustively describe the ingredients they have available, and she doesn't say how the person cooking is willing or able to combine them
[12:32:18] <mru> (most places probably have eggs, but it serves as an example)
[12:32:23] <ohsix> the analogy is beyond strained
[12:32:45] <mru> of course she doesn't list all the ingredients available
[12:32:49] <ohsix> i ask for fried eggs regardless; she says if the cook can do it
[12:32:49] <mru> that's why you're given the menu
[12:33:16] <ohsix> the menu does describe valid combinations of food configuration; yes
[12:33:40] <ohsix> how would you express these valid combinations for hardware if you don't know they're valid until they're applied?
[12:34:15] <ohsix> and i'm not just talking about in alsa, i'm talking about the driver literally not knowing until the device gives them a go/no-go for validity
[12:35:10] <ohsix> perhaps overclocking is an appropriate analogy; you have a search space and you have no way to know if its valid for any configuration until you try it; and test it to see if it is fit for purpose
[12:35:28] <ohsix> but you do know that the default clocks will work
[12:35:47] <ohsix> 44.1 16bit is not unlike a default clock, and a good thing to try
[12:35:55] <mru> I still don't see why it's so damn hard to tell me what the supported sample rates are
[12:36:12] <ohsix> because it is not all the information you need to build a configuration
[12:36:21] <mru> no, but it's helpful
[12:36:22] <ohsix> take 192khz, say alsa told you the device could do it
[12:36:44] <ohsix> it doesn't tell you that its only valid with 16bit samples; but the device also does 24bit samples
[12:36:59] <ohsix> and indeed the driver may not know until it tries to set the configuration
[12:37:19] <mru> suppose instead the device only goes up to 96kHz
[12:37:25] <mru> then I'd know there's no point attempting 192
[12:37:47] <ohsix> sure; but software that needs 192khz is rare anyways
[12:38:01] <mru> ?????
[12:38:07] <ohsix> you might say for any given widget, trying anything other than 44.1 is silly
[12:38:16] <mru> why would I say that?
[12:38:34] <pross-au> thats just crap
[12:38:35] <ohsix> you just might; given you might not try 192khz anyways when the device only does 96khz
[12:38:58] <mru> is it just me, or is ohsix making even less sense than usual?
[12:39:10] <ohsix> but all you can do is ask if your configuration is valid; the driver may not know until it is applied, the driver has no apriori knowledge that can help you improve that process
[12:39:23] <mru> sure it does
[12:39:31] <ohsix> heh
[12:39:31] <pross-au> devices that have the capacity to list their supported sample rates, should be afforded an userland api to access those rates
[12:39:48] <mru> exactly
[12:39:50] <ohsix> pross-au: then it should tell you which configurations are possible, right?
[12:40:20] <ohsix> the problem there is the _driver may not know_, until the parameters are tried
[12:40:21] * mru is reminded of EDID
[12:40:22] <kshishkov> pross-au: there was such thing for OSS
[12:40:28] <pross-au> ohsix: yes
[12:40:43] <pross-au> ohsix: then have a flag to indicate that
[12:40:48] <mru> reducing the number of combinations to try is always a good thing
[12:40:59] <ohsix> so should the driver do the search for valid configurations, at boot time? instead of just trying to set the pcm config to what you're asking for?
[12:41:06] <ohsix> of course it is i'm not discounting that
[12:41:07] <mru> now suppose I want to maximise bitdepth at all cost
[12:41:19] <pross-au> ohsix: of course not
[12:41:23] <ohsix> but you most likely will try one config and it'll work
[12:41:24] <mru> then I'll query the supported bitdepths, choose the highest one, then search for a sample rate that supports it
[12:41:35] <mru> or similarly for any other parameter
[12:41:46] <ohsix> mru: right; that's what i'd do, and said as much earlier
[12:41:53] <mru> but that's impossible
[12:42:08] <ohsix> _if_ i had to; practically speaking most of the time i'm gonna want 44.1khz, 16bit
[12:42:17] <mru> since you refuse to tell me the supported rates and bitdepths
[12:42:37] <mru> what you might want most of the time has zero relevance here
[12:42:46] <ohsix> i'd pick 24, then 192, works? no, 96k, works? no, 48k, works? yes but i'd have to resample, 44.1? good! it works
[12:42:59] <ohsix> it just speaks to common usage
[12:43:28] <ohsix> a _user_ that might want to know the search space of a device without exaustively trying every combination will look at his chips datasheet, or /proc/asound/*
[12:43:33] <mru> now you missed both 32 and float
[12:43:43] <ohsix> i don't need 32 or float
[12:43:55] <mru> you != everybody
[12:44:07] <ohsix> and people that do need it start with it, so?
[12:44:22] <mru> should my app google for the datasheet and parse it?
[12:44:46] <ohsix> thats why "common use" is important, 44.1khz 16bit is pretty much good to go, first try; huge possible search space but they nailed it on the first try
[12:45:00] <mru> you're impossible
[12:45:05] <ohsix> heh
[12:45:20] <ohsix> its not a bad interface in light of possible bad configurations
[12:45:27] <mru> wtf
[12:45:31] <pross-au> what if 44.1 16khz is not supported
[12:45:46] <ohsix> if all drivers could know all possible configurations, then it would be reasonable just to iterate over all possible configurations
[12:45:47] <pross-au> are you saying EVERY userland app has to perform its own detection
[12:45:54] <pross-au> imho, that's the job of the audio subsystem
[12:46:22] <ohsix> yes; if they want to use hw:, they're using the hardware, if they don't want to know or care, they use default, which might resample or do other magical things
[12:46:46] <ohsix> the "audio subsystem" will be fine if you aren't concerned with lording over hw:, it will do what you ask in the manner that it can
[12:46:52] <mru> how can anyone so totally fail to miss the point?
[12:47:15] <ohsix> you understand there are invalid configurations yes?
[12:47:25] <mru> of course
[12:47:29] <ohsix> you understand the driver might not know what those are until they are set on the device
[12:47:35] <pross-au> depends. i've seen devices freeze when given invalid configurations
[12:47:45] <mru> but you're refusing to give me so much as a hint on what might be valid
[12:48:33] <ohsix> sure; i'm just saying in light of invalid configurations its kind of meaningless; you know the spectrum of sample rates and bit depths your software might be considered for, and you'd simply try and use them
[12:49:02] <ohsix> most bad software just gives you a place to put in the sample rates and buffer sizes/periods; and they aren't even trying to be clever
[12:49:10] <mru> what if my software works with _any_ configuration?
[12:49:28] <twnqx> then it's well behaving software :P
[12:49:30] <pross-au> thank god they got berkley sockets right the first time
[12:49:46] <ohsix> then certainly one configuration would be preferable over all of them? be it 192khz and 24bit; or 44.1 16bit; you'd pick the one that is preferable to the application you are writing
[12:49:47] <mru> so alsa is designed for crap software?
[12:49:59] <mru> why would one be preferred?
[12:50:19] <ohsix> why wouldn't it? if you're going to search the entire config space, you're not looking for a preffered format?
[12:50:20] <pross-au> my sb card prefers 8-bit 8000Hz
[12:50:30] <ohsix> if you're not looking for something then why are you searching
[12:50:30] <mru> for playback of files, sure whatever is in the file is the preferred setting
[12:51:06] <mru> if I can synthesise whatever I want, I'll want to choose something the hw supports
[12:51:18] <mru> probably the best it can support according to some metric
[12:51:33] <ohsix> right; so test the configuration in order of most prefered to least prefered, suitable for your application
[12:51:38] <pross-au> so ffplaying 8khz game samples, you suggest i resample them to 44.1khz, then have the audio driver re-resample back to 8khz.
[12:51:55] <ohsix> as i stated; i'd go with bit depth first, but it can be any criteria
[12:51:56] <mru> how do I test all possible configurations?
[12:52:07] <mru> there are billions
[12:52:16] <ohsix> you don't test all possible configurations, and there aren't billions
[12:52:37] <ohsix> pross-au: driver doesn't resample anything
[12:53:29] <ohsix> if you were playing a file the first format you would try would probably be without any conversion of any sort; and you'd try 8khz first in that case
[12:53:46] <mru> forget that case
[12:53:55] <mru> suppose I can generate _any_ format
[12:54:09] <mru> fine, let's limit it a bit
[12:54:13] <ohsix> you can; but hardware doesn't play any format
[12:54:19] <pross-au> pcskr cool
[12:54:27] <mru> up to 64 channels, up to 32 bits or float, up to 512kHz
[12:54:36] <mru> now where do I start?
[12:54:56] <mru> so how do I find the best format the hardware does support?
[12:55:17] <ohsix> i get your point, but i also know you don't have to ever conduct an exaustive search of possible configurations, and if you did, by some chance; you'd sooner calculate it by reading the spec sheet rather than exercising it with alsa, something that might not even explore some dimensions of the device
[12:55:26] <merbzt> I have 11111Hz 8bit wav files, how should I do ?
[12:55:56] <mru> so you're saying my app needs to include datasheets for all sound cards?
[12:56:18] <KotH> can someone tell me what all this noise is about? or where the discussion started?
[12:56:19] <kshishkov> mru: precisely! And don't ask a driver for model name too.
[12:56:24] <ohsix> heh if a device has 64 channels alsa is going to split them up in a manner where you'd know there were 64 (maybe not 64 that went together, but 64 something)
[12:56:28] <mru> KotH: someone said alsa sucks
[12:56:36] <kshishkov> someone?
[12:56:37] <KotH> mru: true things are true
[12:57:19] <pross-au> other then linux, who else uses alsa?
[12:57:26] <ohsix> merbzt: try setting it, if its not one of the hw that can do a continuous range you use your resampler to 12khz or something that gives you a nice integer ratio of expansion for a cheap resample
[12:58:12] <merbzt> still 8 bits ?
[12:58:13] <pross-au> quit <N/C>
[12:58:21] <ohsix> ya still 8 bits
[12:58:51] <ohsix> 8 bits isn't all that unusual, if you had a weird sample rate like that you'd probably adjust sample rate first
[12:59:11] <mru> not if the hw supports it
[12:59:44] <ohsix> which would mean the first attempt at setting a device configuration would succeed
[12:59:45] <kshishkov> merbzt: dig out original SoundBlaster and enjoy
[13:00:18] <ohsix> i thought i said as much in the original reply :]
[13:01:24] <ohsix> mru: fwiw i've worked with a lot of interfaces like that; maybe it sucks, maybe it doesn't, but when you have 2 phase initialization theres not much you can do sometimes, but try a configuration and fail, generally you know within a very small window what is acceptable before outright failure is an option
[13:01:58] <merbzt> we want the pony
[13:02:10] <merbzt> not just everything, a pony also
[13:02:38] <ohsix> zomg ponies
[13:03:13] <ohsix> its like on old vga hardware, only some register settings got you a picture; or a monitor that caught fire :]
[13:03:27] <ohsix> vga hardware isn't going to tell you how to light a given monitor on fire
[13:04:14] <ohsix> but if you know the model monitor you can make an educated guess, if that was your goal (i think it was horizontal deflection, later models would just shut off though)
[13:05:15] <ohsix> the problem was the juice for the horizontal deflection would cause a secondary ringing in the flyback; that amounted to a large voltage flux that wasn't clamped in any meaningful way
[13:12:50] <mru> monitors tell you what they support
[13:12:53] <mru> it's called edid
[13:14:17] <ohsix> heh, not when you could set them on fire :] (and you can still send them invalid signals)
[13:19:59] <ohsix> morning BBB
[13:20:06] <BBB> howdy
[13:24:37] <ohsix> mru: the invalid configuration stuff is even more important with chained elements, i know you think that part shouldn't exist; but it constrains valid configurations even further, to a subset of the element and what it is slaved to
[14:39:22] <censor> hi all
[14:39:53] <censor> i just read that 0.6 was released, with RTMP support - does that include RTMP broadcast/publishing, how you need it for Akamai live streaming, and if that's not the case, is it planned to support it?
[14:39:57] <mru> we don't want censorship here
[14:40:40] <censor> me neither ;)
[14:41:00] <censor> look up the egyptian explanation of the name...
[14:41:53] <wbs> censor: yes, it should support broadcast/publishing, both using the lavf-internal rtmp protocol and using librtmp
[14:42:09] <wbs> whether it actually works with all rtmp servers is another question though
[14:42:13] <censor> great news, thanks!
[14:42:27] <censor> well, at least it's worth trying =)
[15:48:23] <BBB> how do I know in what CPU version a particular instruction is available?
[15:54:59] <kshishkov> look at the reference
[15:55:53] <kshishkov> each CPU supports whole instruction sets like SSE, not "Intel Pentium III rev 2, now with movupd"
[15:56:37] <BBB> I know that, but the intel manual doesn't say whether each instruction belongs to "mmx", "sse2", or whatever
[15:56:49] <BBB> and most useful instructions are ssse3, which my cpu doesn't have ;(
[15:57:52] * kshishkov usually can check that in NASM docs
[15:57:59] <kshishkov> appendix B
[15:58:15] <BBB> ok
[16:05:15] <pengvado> intel manual does say, but not in a concise way
[16:06:09] <pengvado> e.g. SSE instructions will say "If CPUID.01H:EDX.SSE[bit 25] = 0" in the possible exceptions table.
[16:08:31] <Dark_Shikari> mru: http://games.venturebeat.com/2010/06/17/gaikai-signs-ea-as-digital-distribution-partner/
[16:15:58] <av500> Dark_Shikari: ping
[16:16:03] <BBB> I need a lookup cheatsheet where I type in what I want my favourite instruction to do, and it'll tell me what's the closest instruction that matches that description
[16:16:15] <Dark_Shikari> av500: ping
[16:16:21] <av500> Dark_Shikari: see pm
[16:16:45] <mru> BBB: such a thing exists
[16:16:49] <mru> BBB: it's called brain
[16:16:53] <mru> needs some training though
[16:16:59] <BBB> heh :) yeah thanks
[16:16:59] <av500> mru: where can I donwload one?
[16:17:46] <Dark_Shikari> av500: yes
[16:29:06] <BBB> dark_shikari: ok, so I'm trying something simple, a horizontal-only 4-tap subpel MC in a 4x4 block (in plain mmx, as a start, does that make sense?)... pmaddwd isn't terribly useful, is it? I'm trying to multiply all four pixels that are used to calculate the destination pixel at once, but pmaddwd gives me the result as two dwords in a mm register, and I can't figure out how to add them.. is it easier to multiply "the whole row at once" for the 
[16:29:06] <BBB> tap, then again for the second, and then use paddsb to add it up and write out the whole row at once?
[16:29:32] <Dark_Shikari> you use paddw not paddsb
[16:29:40] <Dark_Shikari> then psraw when you're done
[16:29:41] <Dark_Shikari> then packuswb
[16:29:46] <BBB> don't I have to clip then?
[16:29:50] <Dark_Shikari> saturate
[16:29:52] <Dark_Shikari> that's what packuswb does
[16:30:01] <BBB> oh ok, I thought paddsb would do that
[16:30:09] <BBB> so pmaddwd is useless?
[16:30:27] <Dark_Shikari> er... but that's not how it works
[16:30:28] <Dark_Shikari> it's
[16:30:34] <Dark_Shikari> (A+B+C+D... + round)>>X
[16:30:38] <Dark_Shikari> you can't saturate until after the >> X
[16:30:52] <BBB> oh right
[16:31:16] <Dark_Shikari> pmaddwd is useful here
[16:31:37] <Dark_Shikari> suppose you need to calculate A*src[-1] + B * src[0] + C * src[1] + D * src[2] for each pixel.
[16:32:03] <Dark_Shikari> you create a global constant xmm7 = {A,B,A,B}, signed words
[16:32:12] <Dark_Shikari> you create a global constant xmm6 = {C,D,C,D}, signed words
[16:32:18] <Dark_Shikari> er, I mean, mm7/mm6
[16:32:22] <Dark_Shikari> not xmm since you're doing mmx.
[16:32:28] <Dark_Shikari> movq mm0, [src-1]
[16:32:30] <Dark_Shikari> movq mm1, [src]
[16:32:34] <Dark_Shikari> movq mm2, [src+1]
[16:32:37] <Dark_Shikari> movq mm3, [src+2]
[16:32:41] <Dark_Shikari> punpcklbw mm0, mm1
[16:32:44] <Dark_Shikari> punpcklbw mm2, mm3
[16:32:49] <Dark_Shikari> pmaddwd mm0, mm7
[16:32:54] <Dark_Shikari> pmaddwd mm2, mm6
[16:33:07] <Dark_Shikari> paddw mm0, mm2
[16:33:15] <BBB> wait wait, you shouldn't tell me the solution :-p
[16:33:17] <Dark_Shikari> paddw mm0, ROUND
[16:33:20] <Dark_Shikari> psrlw mm0, SHIFT
[16:33:23] <BBB> that way I'll never get it ;)
[16:33:26] <Dark_Shikari> etc
[16:33:28] <Dark_Shikari> =p
[16:34:51] <BBB> so why would you interleave mm0 and mm1? to make words, shouldn't you interleave with zero before the pmaddwd?
[16:34:59] <BBB> there's probably magic there, but what is your magic? :)
[16:35:05] <Dark_Shikari> Oh, yeah, I screwed up
[16:35:10] <Dark_Shikari> it should be both.
[16:35:33] <Dark_Shikari> 1) interleave with zero, 2) interleave with each other
[16:35:42] <BBB> why interleave with each other?
[16:35:43] <Dark_Shikari> thus it'd be movd instead of movq
[16:35:47] <Dark_Shikari> here's why
[16:35:52] <Dark_Shikari> you could just interleave with zero and use pmullw
[16:35:58] <Dark_Shikari> But that way you need _4_ registers with constants
[16:36:00] <Dark_Shikari> A, B, C, D.
[16:36:13] <Dark_Shikari> If you interleave, you just need {A,B,A,B} and {C,D,C,D}
[16:36:28] <Dark_Shikari> i.e. 2 pixels from each of two sources in each register
[16:36:36] <Dark_Shikari> You will be register-strapped here.
[16:36:54] <Dark_Shikari> You might want to just do the naive way first.
[16:37:09] <BBB> oh, and then pmaddwd autoadds them halfly after mul, paddd adds them for the second half
[16:37:18] <BBB> and I did two pixels in the row at once
[16:37:21] <BBB> I think I get it
[16:37:34] <BBB> that makes sense
[16:37:37] <Dark_Shikari> It's probably easier to start with pmullw.
[16:37:40] <Dark_Shikari> But you can try both.
[16:37:46] <Dark_Shikari> My method will do 2 pixels at once
[16:37:50] <BBB> I'll try, I can always throw it out :)
[16:37:52] <Dark_Shikari> so you'd do 2 pixels, then another 2 pixels
[16:37:58] <BBB> right, and then next row
[16:38:01] <Dark_Shikari> then punpckldq
[16:38:04] <Dark_Shikari> to get 4 pixels
[16:38:06] <Dark_Shikari> then packuswb
[16:38:09] <BBB> should I code loops in asm, or just a macro-loop as you explained yesterday?
[16:38:56] <Dark_Shikari> you should generally not needlessly unroll code _unless_ it lets you save ops
[16:39:06] <Dark_Shikari> for example, you can unroll by a factor of two to get a packusswb in there
[16:39:13] <Dark_Shikari> because packuswb lets you do two things at once
[16:39:14] <Dark_Shikari> e.g.
[16:39:20] <Dark_Shikari> packuswb 0A0B, 0C0D
[16:39:21] <Dark_Shikari> gives you
[16:39:23] <Dark_Shikari> ABCD
[16:40:29] * BBB tries the two-pixel approach
[16:40:39] <BBB> don't worry, I'll ask more silly questions
[16:40:44] <BBB> I'll get this one day
[17:32:34] <av500> hmm, these pesky ffdevs stand in the way of progress always :)
[17:33:00] <elenril> ffprogress!
[17:34:27] <mru> I'm a bit puzzled
[17:34:35] <mru> I want the patch applied, michael says no
[17:34:40] <mru> yet I'm the one blocking something
[17:34:45] <mru> I just don't get it
[17:34:51] <av500> i meant the vp8 thread :)
[17:35:09] <mru> s/michael/google/
[17:39:06] <av500> but then, gg is present in both threads...
[17:44:53] <av500> [amrnb @ 0x8b94160]dtx mode not implemented
[17:47:19] <av500> is amrnb now over? since we have libopencore-amr?
[17:47:48] <kshishkov> on the contrary
[17:47:52] <mru> your troll attempt is far too obvious
[17:48:27] <kshishkov> av500: feel free to yell at superdump to make him implement DTX
[17:48:43] * av500 yells at superdump to make him implement DTX
[17:48:55] <kshishkov> av500: though he's not related to TI it should make you feel better
[17:49:18] <av500> kshishkov: you have very wrong picture of me
[17:49:43] * mru has a big picture
[17:50:11] <kshishkov> av500: should I look in the mirror and compare?
[18:08:06] <BBB> Dark_Shikari: is there a packusdw-like instruction?
[18:08:14] <BBB> or one without clipping is fine also
[18:09:00] <Dark_Shikari> yes.... in sse4.  but why do you need it?
[18:09:22] <Dark_Shikari> packssdw is fine because your 32-bit values will never even be larger than 16-bit
[18:09:56] <BBB> true
[18:10:47] <Dark_Shikari> so in short, here's how I'd do it in mmx
[18:11:04] <Dark_Shikari> 2 sets of pmaddwd stuff -> packssdw -> 4 pixels as words
[18:11:11] <Dark_Shikari> add results -> 4 final pixels as words
[18:11:20] <Dark_Shikari> only add _after_ packing to words because it's one less op (one add instead of two)
[18:11:28] <Dark_Shikari> then, repeat that step twice, so you have two sets of 4 pixels
[18:11:31] <Dark_Shikari> then packuswb
[18:11:32] <Dark_Shikari> then store
[18:11:37] <Dark_Shikari> thus, 4 sets of pmaddwd -> 8 output pixels
[18:17:31] <BBB> 2 sets of pmaddwd only multiply 8 numbers, for a four-tap filter that's 2 pixels, not 4
[18:17:32] <BBB> ?
[18:17:42] <BBB> the rest I understand
[18:17:47] <BBB> I have a function that's almost there
[18:17:54] <BBB> only need to add the filter constants somewhere now
[18:17:57] <Dark_Shikari> "one set of pmaddwd" calculates 2 pixels
[18:17:58] <BBB> and test and see how often it crashes
[18:18:10] <Dark_Shikari> "one set of pmaddwd" is two pmaddwds and associated support code
[18:18:18] <BBB> ok, I got it then
[18:18:24] <BBB> I think I have the code you're thinking of
[18:18:26] <Dark_Shikari> so basically it's a branch out, then branch in
[18:18:29] <BBB> let me test it :)
[18:18:29] <Dark_Shikari> load 8 pixels from each source
[18:18:37] <Dark_Shikari> split into 4 sets of pmaddwd
[18:18:39] <Dark_Shikari> combine it back together.
[18:18:44] <Dark_Shikari> or you can do 4 pixels, then split into 2 sets.
[18:19:00] <Dark_Shikari> I would do the loading 4 pixels to start -- less register pressure
[18:19:03] <Dark_Shikari> pastebin it when you're done
[18:19:33] <BBB> ok
[18:19:39] <BBB> I'm doing 4 pixels right now
[18:19:45] <BBB> it actually fits in 6 registers, I think
[18:19:49] <Dark_Shikari> including constants?
[18:19:53] <Dark_Shikari> you'll need regs for constants
[18:19:54] <BBB> yes
[18:19:56] <Dark_Shikari> sweet.
[18:19:57] <Dark_Shikari> let me see.
[18:19:58] <BBB> only 2
[18:20:00] <BBB> in a bit
[18:20:01] <BBB> still working
[18:20:05] <Dark_Shikari> Yeah, isn't my trick nice to save regs ;)
[18:20:16] <Dark_Shikari> And save two adds, I guess.
[18:20:17] <BBB> I think pmullw would've needed 7 or 8
[18:20:20] <BBB> I didn't try
[18:20:25] <Dark_Shikari> k
[18:20:50] <BBB> punpck* reg, [mem] <- does mem need to be aligned in any way?
[18:20:56] <Dark_Shikari> for mmx: no
[18:21:02] <BBB> ok, good
[18:21:03] <Dark_Shikari> Keep in mind that it's best still to avoid crossing cachelines
[18:21:06] <Dark_Shikari> now, here's a gotcha
[18:21:10] <Dark_Shikari> a horrible, horrible gotcha
[18:21:20] <Dark_Shikari> mmx punpcklXX loads 4 bytes from memory
[18:21:23] <Dark_Shikari> using the 32-bit load unit.
[18:21:25] <Dark_Shikari> This makes sense, right?
[18:21:41] <Dark_Shikari> because it only uses the low half of each unit.
[18:21:44] <Dark_Shikari> of each reg, that is
[18:21:51] <Dark_Shikari> right?
[18:26:42] <BBB> I guess so
[18:26:50] <BBB> I'm only reading 4 bytes in every iteration anyway?
[18:27:04] <BBB> why?
[18:27:30] <Dark_Shikari> because we haven't modified it to use 8 yet
[18:30:27] <Dark_Shikari> anyways, per what I said above
[18:30:29] <Dark_Shikari> here's the gotcha
[18:30:35] <Dark_Shikari> you would think that an sse punpckl would read 64 bits, right?
[18:30:40] <Dark_Shikari> since it reads the lower half of 128-bit.
[18:30:51] <Dark_Shikari> But it doesn't.  It reads the full 128 bits of memory, even if it only uses 64 bits of it.
[18:31:03] <Dark_Shikari> Which means it crashes on unaligned loads.
[18:32:27] <BBB> does punpcklbw mm0, 0 make sense to do a zero-extend byte->word?
[18:32:31] <BBB> yasm doesn't appear to like it
[18:32:39] <Dark_Shikari> of course not
[18:32:44] <Dark_Shikari> you can't have immediates for mmx or xmm
[18:32:47] <Dark_Shikari> 0 must be a register
[18:32:53] <BBB> the docs say I can give it a mm32
[18:32:56] <BBB> isn't that a constant?
[18:33:05] <Dark_Shikari> no, that's the lower half of an mmx register
[18:33:12] <BBB> oh
[18:33:22] <BBB> damnit, so I need 7 registers then, one as my zero :)
[18:34:32] <Dark_Shikari> btw, what are the 4 tapfilter constants?
[18:34:40] <Dark_Shikari> sometimes you can do magic if the numbers are right
[18:36:56] <BBB> they depend on one of the input argumens, similar to chroma mc in h264
[18:37:01] <BBB> mx/my are function args
[18:37:35] <BBB> 6, 123, 12, 1 | 9, 93, 50, 6 or the reverse of either of these two
[18:38:58] <BBB> http://ffmpeg.pastebin.com/ws4QEkYw
[18:39:11] <BBB> I haven't tested it yet, I still have to integrate it into the calling code
[18:39:15] <BBB> but it compiles at least
[18:39:27] <BBB> and the fourtimes_64 thing is a hack because I'm lazy
[18:39:32] <BBB> oh, it's actually 63, typo
[18:39:47] <BBB> hm, no, it's 127, double typo
[18:39:50] <BBB> anyway, needs testing
[18:40:08] <BBB> n/m
[18:42:00] <Dark_Shikari> there should be no tapfilter setup code
[18:42:04] <Dark_Shikari> prepare all the constants globally
[18:42:10] <Dark_Shikari> also you can use "times 4 dw 64"
[18:42:41] <Dark_Shikari> i.e. the punpck/punpck code above nextrow should be gone
[18:42:53] <Dark_Shikari> 31 is BCDE, not EFGH
[18:42:54] <BBB> right, that could be integrated int he table
[18:43:17] <Yuvi> ugh, the filter really does need 17 bit signed intermediates
[18:43:28] <Dark_Shikari> o.0
[18:43:36] <Dark_Shikari> oh fuck
[18:43:38] <Dark_Shikari> what the fuck
[18:43:46] <BBB> why?
[18:43:48] <Dark_Shikari> they're insane
[18:43:53] <Yuvi> (3+77+77+3)*255
[18:43:57] <Yuvi> as max
[18:43:59] <kshishkov> sounds like VP3 :)
[18:44:10] <BBB> but it's unsigned
[18:44:25] <Yuvi> -(16+16)*255
[18:45:10] <BBB> hello by the way, have you been idling and looking at this all morning? :-p
[18:45:17] <Yuvi> barely
[18:45:21] <BBB> pheew
[18:45:34] <Yuvi> I just thought the paddd was ugly
[18:45:39] <Yuvi> but it appears to be needed
[18:45:43] <BBB> I agree
[18:45:50] <BBB> but I couldn't think of anything better for now
[18:45:54] <BBB> hey, it's my first asm ;)
[18:46:30] <Dark_Shikari> you should generally group related instructions together
[18:46:31] <Dark_Shikari> e.g.
[18:46:32] <Dark_Shikari> movd/movd
[18:46:34] <Dark_Shikari> punpck/punpck
[18:46:38] <Yuvi> it's not bad ;)
[18:46:40] <Dark_Shikari> it's clearer to read
[18:46:55] <Dark_Shikari> and better on in-order
[18:46:58] <BBB> how do I integrate this in vp8dsp.c?
[18:47:12] <BBB> I'll do that, but would like to test and see if it crashes
[18:47:15] <BBB> or maybe it works
[18:47:17] <Dark_Shikari> you should use pw_64, not fourtimes_64
[18:47:20] <Dark_Shikari> as the name
[18:47:21] <Yuvi> make x86/vp8dsp.c, similar to x86/mlpdsp.c
[18:47:22] <Dark_Shikari> and put it in mm7
[18:47:36] <Dark_Shikari> If you want to save a reg, use my pavgw trick
[18:47:39] <Dark_Shikari> Oh, you made a mistake
[18:47:42] <Yuvi> and make+call ff_vp8dsp_init_mmx
[18:48:01] <Dark_Shikari> you're going to need to add twotimes_64 _before_ the pack
[18:48:04] <Dark_Shikari> with paddd
[18:48:15] <Dark_Shikari> oh wait, no, it doesn't matter does it?
[18:48:26] <Dark_Shikari> because it would get saturated...
[18:48:34] <Dark_Shikari> hmm.  this is a good question
[18:48:42] <Dark_Shikari> Yuvi: do we actually need 17-bit intermediate?
[18:48:50] <Dark_Shikari> the right shift is by 7
[18:48:56] <Dark_Shikari> that means anything larger than 1 << 15 will saturate
[18:49:03] <Dark_Shikari> er, >= 1<<15
[18:49:27] <Yuvi> hm, maybe not
[18:49:32] <Dark_Shikari> BBB: your code doesn't use r4 and r5
[18:49:46] <BBB> it uses r4 in the beginning
[18:49:54] <Dark_Shikari> where
[18:50:05] <BBB> where it uses r2, that should be r4 :-p
[18:50:09] <BBB> r5 is indeed unused
[18:50:14] <Dark_Shikari> isn't r4 and r5 the mv?
[18:50:18] <Dark_Shikari> or are they just the low bits of each mv?
[18:50:35] <BBB> low bits
[18:50:42] <BBB> this is a H-only function
[18:50:46] <BBB> was simplest to start with
[18:50:48] <Dark_Shikari> so, BBB, you have two choices
[18:50:50] <BBB> so my is 0
[18:50:56] <Dark_Shikari> add 2z64 before packssdw
[18:50:59] <Dark_Shikari> *2x64
[18:51:02] <Dark_Shikari> or add 4x64 after
[18:51:05] <Dark_Shikari> you're using "paddd"
[18:51:08] <Dark_Shikari> that's a 32-bit add
[18:51:10] <Dark_Shikari> your source is 16-bit
[18:51:12] <Dark_Shikari> oops ?
[18:51:15] <BBB> oh, right, oops
[18:51:32] <BBB> paddw then?
[18:51:39] <BBB> is probably faster than 2x paddd
[18:51:47] <Dark_Shikari> yes
[18:52:46] <Yuvi> but you'll have to be vary vary careful about order if you sum the taps with 16 bit signed adds
[18:53:00] <Dark_Shikari> well he's using 32-bit now
[18:53:09] <BBB> should ff_vp8_dsp_x86_init() be in libavcodec/vp8dsp.c or should I create another C file in x86/?
[18:53:14] <Dark_Shikari> BBB: also you could use the pavgw trick here to save a register
[18:53:14] <BBB> yeah right now it's 32 bit, should be ok
[18:53:18] <Yuvi> another C in x86
[18:53:31] <BBB> Dark_Shikari: I'll use it, let me first try this code, just to see if it works ;)
[18:53:40] <BBB> I'm still learning ;)
[19:09:46] <BBB> Undefined symbols:
[19:09:46] <BBB>   "_ff_put_vp8_epel4_h4_mmx", referenced from:
[19:09:46] <BBB>       _ff_put_vp8_epel4_h4_mmx$non_lazy_ptr in libavcodec.a(vp8dsp-init.o)
[19:09:56] <BBB> probably stupid question or so, but there's really no typo :)
[19:10:21] <Dark_Shikari> check in the .o what it's actually called.
[19:10:29] <kshishkov> is your function declared global?
[19:10:45] <BBB> oh, so smart, it prefixes the ff itself
[19:10:48] <BBB> I had called it ff_ already
[19:10:56] <BBB> so it was _ff_ff_put
[19:11:54] <BBB> whoa it's bitexact
[19:12:03] <Dark_Shikari> WHAT?
[19:12:04] <BBB> there were no additional typos :-p
[19:12:06] <Dark_Shikari> You got it right the first time?
[19:12:07] <Dark_Shikari> awesome.
[19:12:11] <BBB> hehe :)
[19:12:11] <Dark_Shikari> that is the best feeling in the world
[19:12:16] <Dark_Shikari> When you write a big fancy function
[19:12:18] <Dark_Shikari> and its RIGHT
[19:12:25] <BBB> this one is really quite small :)
[19:12:32] <mru> usually when that happens it's not actually being called
[19:12:36] <Dark_Shikari> mru: true
[19:12:38] <Dark_Shikari> =p
[19:12:41] <Dark_Shikari> Highly true.
[19:12:42] <BBB> how do I test that?
[19:12:45] <mru> so I insert a deliberate error just to be sure
[19:12:46] <BBB> it's very likely
[19:12:51] <Dark_Shikari> BBB: "mov esp, 0"
[19:12:56] <Dark_Shikari> ;)
[19:12:57] <BBB> ok
[19:21:12] <BBB> I guess I need a better test movie
[19:21:21] <BBB> Yuvi: didn't you have a movie that had a lot of subpel mv?
[19:21:43] <Yuvi> http://www.supergenije.com/cruncher/test.webm
[19:22:07] <Dark_Shikari> BBB: or you made a booboo
[19:22:24] <Dark_Shikari> it's rather unlikely that a particular subpel position will never get called.
[19:22:25] <BBB> the function pointer is set
[19:22:41] <Dark_Shikari> unelss the clip was literally encoded with no subpel
[19:24:05] <av500> its vp8, anything is possible...
[19:25:00] <BBB> 4x4 is only used for split subpel, my guess is the movie has a lot of full blocks (setting 16x16 4tap function to NULL crashes instantly), but few split blocks
[19:25:17] <Dark_Shikari> Oh
[19:25:19] <Dark_Shikari> this is 4x4 only
[19:25:22] <BBB> right
[19:25:25] <Dark_Shikari> and you don't do 8x8 by calling 4x4 a lot
[19:25:30] <BBB> right
[19:25:31] <BBB> yet
[19:25:32] <Dark_Shikari> fyi, 16x16 should be done (except in fullpel cases) by calling 8x8
[19:25:38] <Dark_Shikari> for mmx, 8x8 should probably call 4x4
[19:25:39] <Dark_Shikari> for sse, no way
[19:25:44] <Dark_Shikari> since sse can do more pixels at once
[19:25:48] <BBB> right
[19:25:52] <BBB> I understood that much ;)
[19:26:00] <BBB> I'll probably work on a few mmx functions before I go to sse2
[19:26:02] <Dark_Shikari> Also, one of the later optimizations you'll be doing is minimizing loads
[19:26:09] <Dark_Shikari> Notice right now you reload a lot of pixels multiple times
[19:26:12] <Dark_Shikari> this hurts on pre-nehalem intel
[19:26:12] <BBB> right
[19:26:32] <BBB> I figure you can load 8 pixels at once, then shr(8) them for the next pixel?
[19:26:45] <BBB> and use the lower 4 bytes using punpcklXY
[19:27:06] <Dark_Shikari> pshufb is one way to do a ton of magic
[19:27:10] <Dark_Shikari> remember, you have to interleave them
[19:27:15] <Dark_Shikari> so if you can do an arbitrary byte-shuffle...
[19:32:16] <BBB> ok now it's used, and of course it's wrong, let me try and figure out how bad it is :-p
[19:33:22] <wbs> Yuvi: https://roundup.ffmpeg.org/issue2013 may need your attention
[19:33:43] <wbs> av500: yes, dtx is unsupported in the internal amrnb decoder, but if you pass -acodec libopencore_amrnb, it should work
[19:36:08] <av500> wbs: yes, I got that
[19:36:27] <av500> I need it on arm, so i'll have to see how to crossbuild it..
[19:36:45] <av500> but that should be fairly easy as android does that too
[19:37:07] <av500> btw, why are patch files on roundup: application/octet-stream
[19:37:32] <av500> shouldnt they be "text"?
[19:38:47] <mru> did someone just use android and easy in the same sentence?
[19:39:10] <Yuvi> wbs: patch should be ok
[19:39:26] <av500> mru: nice, eh? :)
[19:41:22] <Yuvi> wbs: actually, check ogg_packet.e_o_s
[19:44:54] <Yuvi> BBB: Dark_Shikari: okay, I'm satisfied that the filter can be done with only 16-bit saturating adds, like so: http://pastie.org/1009054
[19:45:56] <BBB> Yuvi: I still don't see why, the sum of all filter coeffs is 127, and pixel is 255 max, that's 15 bits
[19:45:59] <BBB> there's no negative coeffs
[19:46:17] <BBB> so I think it can all be done with much less trouble, no?
[19:46:25] <Yuvi> f1 and f4 are negative
[19:46:51] <Yuvi> and the positive coeffs add up to 160 in the worst case (77+77+3+3)
[19:47:12] <BBB> oh shit I didn't see that
[19:47:17] <BBB> that's probably my bug :-p
[19:47:47] <BBB> ok, so 160*255, got it
[20:12:02] <BBB> how do I print the mmx registers?
[20:13:02] <Dark_Shikari> carefully
[20:13:10] <Honoome> lol
[20:13:54] <BBB> all-registers :)
[20:23:58] <Vitor1001> pengvado: Sorry about the stupid question, but what you means with alternating between two schedules?
[20:24:12] <Vitor1001> You are wondering why I need to do aligned loads?
[20:29:53] <Dark_Shikari> he means why are you doing one ordering of stuff in one place
[20:29:56] <Dark_Shikari> and another ordering in another place
[20:29:56] <Dark_Shikari> iirc
[20:31:14] <wbs> Yuvi: ok, so like this then? http://albin.abo.fi/~mstorsjo/0001-libvorbis-Only-drop-1-byte-packets-at-end-of-stream.patch
[20:31:54] <Vitor1001> Dark_Shikari: I still don't get what you mean with ordering.
[20:32:22] <Vitor1001> You mean the fact I reverse the vector or the weird 1-byte unalignement?
[20:32:34] <Dark_Shikari> instruction ordering
[20:33:26] <Vitor1001> Oh... Actually no idea. Before I started coding in asm I thought instruction ordering was important. Until I started benchmarking :p
[20:33:51] <Dark_Shikari> There's really no reason to pick a particular order in most cases
[20:33:52] <Dark_Shikari> Just be consistent.
[20:34:06] <Vitor1001> ok, good point
[20:39:52] <wbs> av500: I haven't looked into how to enable all the arm optimizations in opencore though
[20:43:35] <ZeZu> instruction ordering "can" be usefull
[20:43:45] <ZeZu> depends on how deep your getting into optimization
[20:43:54] <ZeZu> and the processor of course
[20:44:08] <mru> if you don't have out of order execution, it's _very_ important
[20:44:15] <ZeZu> for in-order execution processors that can still execute multiple instructions
[20:44:16] <ZeZu> yes
[20:44:31] <ZeZu> its absolutely required for good opts
[20:44:36] <mru> and even with it, you must make sure to not overload the reorderqueue
[20:45:48] <ZeZu> and even in the case of deep pipelining and full out of order .. it can be usefull in a variety of cases , to eliminate stalls (that shouldn't happen anyways but do ..) and to keep instructions on word boundaries so you can use faster isntructions in other places
[20:46:20] <ZeZu> damn its time for a new keyboard again
[20:52:11] <BBB> yay my function is bitexact
[20:52:26] <BBB> only 5 or 6 mistakes in dq vs wd or stuff like that :)
[20:52:43] <BBB> I suppose I need to now test if it's faster?
[20:55:51] <_av500_> wbs: there might be none for amr
[20:57:56] <Vitor1001> ZeZu, mru, are there any x86 cpu that don't support out-of-order execution?
[20:58:39] <mru> anything prior to ppro for sure
[20:58:52] <ZeZu> x86 isn't the only arch. I deal with,  but I believe some of the cheap embedded line may not,  amd geode .. but even that prob does
[20:59:09] <mru> atom?
[21:03:07] <BBB> Dark_Shikari: can I use ff_pw_64?
[21:03:16] <BBB> it's defined as xmm_reg_t, I don't know if I can touch that from mmx code
[21:06:54] <ZeZu> atom is out of order
[21:09:31] <ZeZu> VLIW / stream processors are a good example for optimizing instruction ordering, esp. for something like gpu
[21:10:25] <mru> vliw offers total static scheduling
[21:10:36] <mru> which is good for DSPs and such
[21:10:54] <ZeZu> shiny
[21:11:36] <ZeZu> instruction ordering is real fun for dynamic optimization,  if you want to do multi-pass anyhow
[21:17:15] <BBB> Dark_Shikari: and also, other places use [ff_pw_64] directly, they don't actually move it to a register, can I do that too? or bad idea?
[21:17:48] <BBB> oh n/m that, I misread
[21:17:51] <BBB> they do the same as me
[21:22:47] <peloverde> Friendly reminder, the PS patch is looking for review
[21:27:37] <BBB> didn't I review it already?
[21:28:01] <BBB> Dark_Shikari: also, what is the typical performance boost I should see for this function?
[21:28:07] <BBB> (or anyone, for that matter)
[21:28:15] <saintdev> BBB: he posted an updated patch a few days ago
[21:29:18] <BBB> the patch is 144k :-(
[21:29:51] <saintdev> o.O
[21:31:04] <saintdev> BBB: a little later there's a copy that uses tablegen that's only 73K
[21:31:09] <saintdev> :P
[21:31:13] <peloverde> BBB, yes you did, thanks
[21:31:15] <BBB> "only"
[21:31:20] <BBB> I'll look at it again
[21:31:23] <peloverde> The current version is half the size
[21:31:25] <peloverde> due to tablegen
[21:31:43] <saintdev> that's pretty cool :)
[21:32:54] <peloverde> There is a big nasty table used by both the AAC encoder and decoder that I really want to tablegen, but I'm not 100% sure about the best way to do it.
[21:33:48] <Dark_Shikari> BBB: yes you can use pw_64
[21:34:04] <Dark_Shikari> you can use it directly, but don't repeat loads unnecessarily unless you run out of regs
[21:34:09] <Dark_Shikari> it's fine to use it directly if you're only using it once
[21:34:14] <Dark_Shikari> typical performance boosts ranges, a lot.
[21:52:38] <BBB> I'm seeing only a 10% increase
[21:52:55] <BBB> which is rather disappointing... maybe it's because it's only a 4x4 block?
[21:54:25] <Dark_Shikari> In that function?
[21:54:28] <Dark_Shikari> or overall?
[21:54:37] <Dark_Shikari> A normal increase is like 3x, 5x, 10x
[21:55:58] <BBB> that's what I expected also
[21:56:02] <BBB> in this specific function
[21:56:17] <Dark_Shikari> well your cpu does just suck
[21:56:24] <BBB> probably
[21:56:27] <Dark_Shikari> also pastebin the function again
[21:56:58] <BBB> I probably count wrong
[21:57:02] <BBB> let me recount just to be sure
[21:57:05] <BBB> then I'll pastebin it
[21:57:11] <BBB> I think my START_TIMER is placed wrongly
[21:57:21] <Dark_Shikari> You know that start_timer isn't normalized in ffmpeg, right?
[21:57:26] <Dark_Shikari> that is, it doesn't subtract out the cost of an empty timer.
[21:57:31] <Dark_Shikari> You have to do that yourself.
[21:58:20] <BBB> that's fine
[21:58:30] <BBB> I did it for all 4x4 mx&1==1 functions
[21:58:33] <BBB> not just those with my==0
[21:58:39] <BBB> now it's about 2,5x faster
[21:58:43] <Dark_Shikari> Yes, that sounds about right.
[21:58:50] <BBB> I'll pastebin, 1 second
[22:00:49] <BBB> http://ffmpeg.pastebin.com/XUBhFPa7
[22:01:27] <BBB> I'm not using your average-function trick yet, have to look at that
[22:01:30] <Dark_Shikari> fix your constant array
[22:01:33] <BBB> ?
[22:01:37] <Dark_Shikari> i.e. to not do the punpcks on init
[22:01:38] <Dark_Shikari> oh
[22:01:39] <Dark_Shikari> wait
[22:01:40] <Dark_Shikari> you did
[22:01:41] <Dark_Shikari> wait what?
[22:01:52] <Dark_Shikari> oh, I see
[22:01:54] <Dark_Shikari> nevermind, I'm blind.
[22:02:22] <Dark_Shikari> reorder the movds and punpck like I said
[22:02:26] <Dark_Shikari> i.e. movd/movd/punpck/punpck
[22:02:28] <BBB> oh right
[22:02:29] <BBB> ok
[22:03:24] <Dark_Shikari> what's with the sub r0, r1?
[22:03:30] <Dark_Shikari> dst and src are guaranteed to have the same stride?
[22:03:35] <BBB> yes
[22:03:41] <Dark_Shikari> Since my isn't used, you only need 5,5
[22:03:45] <BBB> ok
[22:04:10] <BBB> if I don't use mx, can I somehow convince it to not store it in a register?
[22:04:22] <Dark_Shikari> explain?
[22:04:28] <Dark_Shikari> and why do you use 6,6,2?  you don't use any xmm regs
[22:04:41] <BBB> I thought it was for mm%d regs
[22:04:52] <BBB> how many mm regs are there?
[22:04:54] <Dark_Shikari> the third number is for xmm
[22:04:57] <Dark_Shikari> there are 8 mm regs
[22:05:02] <BBB> oh, just right :)
[22:05:05] <Dark_Shikari> you can use those without telling it
[22:05:11] <Dark_Shikari> it should be 5,5 (with no 2)
[22:05:15] <BBB> ok, changed
[22:05:17] <Dark_Shikari> now, what's your issue with mx?
[22:05:37] <BBB> if I write the v4 variant of this
[22:05:48] <Dark_Shikari> which uses my but not mx?
[22:05:54] <BBB> yes
[22:05:56] <Dark_Shikari> Here's what you do
[22:05:58] <Dark_Shikari> 1) 4,4
[22:06:00] <Dark_Shikari> er, i mean
[22:06:02] <Dark_Shikari> 4,5
[22:06:11] <Dark_Shikari> 2) mov r4, r5m
[22:06:15] <Dark_Shikari> :)
[22:06:23] <BBB> r5m = ?
[22:06:28] <Dark_Shikari> memory location of r5 on the stack
[22:06:30] <Dark_Shikari> actually that's suboptimal
[22:06:33] <Dark_Shikari> what you _should_ do is
[22:06:46] <Dark_Shikari> %ifidn r5, r5m
[22:06:53] <Dark_Shikari> %define my r5
[22:06:55] <Dark_Shikari> %else
[22:07:01] <Dark_Shikari> mov r4, r5m
[22:07:04] <Dark_Shikari> %define my r4
[22:07:05] <Dark_Shikari> %endif
[22:07:18] <Dark_Shikari> On x86_64, r5 == r5m and there's no pushing necessary to get it
[22:07:21] <Dark_Shikari> so you don't want to do the redundant move
[22:07:29] <Dark_Shikari> %ifidn == if identical
[22:07:40] <BBB> omg you are crazy... ok :)
[22:08:02] <BBB> what about the rest of the mmx func?
[22:08:28] <Dark_Shikari> btw, that's an example of register munging being necessary to get _absolutely_ optimal code on all arches.
[22:09:06] <Dark_Shikari> By the way, why don't you pass mx + my<<2 or something?
[22:09:13] <Dark_Shikari> I guess that would end up being more ops.
[22:09:14] <Dark_Shikari> meh
[22:09:18] <BBB> yeah
[22:09:19] <Dark_Shikari> the rest of the asm looks good.
[22:09:27] <Dark_Shikari> dec r3 isn't aligned
[22:09:31] <Dark_Shikari> 158 isn't aligned
[22:09:37] <Dark_Shikari> the instructions should be aligned on commas
[22:09:48] <Dark_Shikari> 122 too
[22:09:51] <Dark_Shikari> at least that's how I do it
[22:10:34] <BBB> dec r3 is an oops :)
[22:10:44] <BBB> the rest, if that's how you do it, I'll change it
[22:11:06] <BBB> so then it's e.g. (122) sub r0_,r1?
[22:11:18] <BBB> with a space before and after the comma
[22:11:26] <BBB> or just sub r0,doublespacer1?
[22:11:34] <Dark_Shikari> no
[22:11:34] <BBB> let me check your x264 code
[22:11:37] <Dark_Shikari> sub  r0, r1
[22:11:45] <Dark_Shikari> i.e. the r0 starts in a different place
[22:11:48] <BBB> oh, so you push r0 forward
[22:11:48] <BBB> ok
[22:12:02] <Dark_Shikari> paste it again when you're done
[22:14:14] <BBB> http://ffmpeg.pastebin.com/vguDqCar <- just vp8dsp.asm
[22:14:30] <BBB> also cleaned up the top of the file a bit, and removed the C table
[22:15:34] <Dark_Shikari> btw, look at vp8_filter_block1d_h6_mmx in libvpx/vp8/common/x86/subpixel_mmx.asm
[22:16:08] <BBB> well that's cheating :-p
[22:16:10] <Dark_Shikari> now think to yourself how much fucking better yours is.
[22:16:21] <Dark_Shikari> Except for the shifting bit at the start, but you can copy that.
[22:16:29] <Dark_Shikari> Yours is like half the size.
[22:16:44] <Dark_Shikari> and probably faster.
[22:17:11] <Dark_Shikari> ok, true, theirs does 8 pixels instead of 4.
[22:17:24] <Dark_Shikari> or wait... no it doesn't
[22:17:28] <Dark_Shikari> wait, theirs is totally fucked up
[22:17:28] <Dark_Shikari> WTF
[22:17:32] <Dark_Shikari>         packuswb    mm3,    mm0              ; pack and unpack to saturate
[22:17:33] <Dark_Shikari>         punpcklbw   mm3,    mm0              ;
[22:17:37] <Dark_Shikari> LOL
[22:17:43] <Dark_Shikari> AHAHAHAHAHAHAHAHAHAHHAHAHAHAHAHAHAHAHAHAHA
[22:17:46] <Dark_Shikari> mru: oh god
[22:19:11] <BBB> they calculate a sixtap, even if it's a fourtap
[22:19:15] <BBB> that's why it's so weird
[22:19:16] <Dark_Shikari> not just that
[22:19:19] <Dark_Shikari> no that isn't the only thing
[22:19:25] <Dark_Shikari> They have 4 pixels, in 16-bit
[22:19:27] <Dark_Shikari> they pack to 8-bit
[22:19:29] <Dark_Shikari> and then they unpack again
[22:19:30] <Dark_Shikari> and store
[22:19:41] <BBB> isn't that a little retarded?
[22:19:44] <Dark_Shikari> Yes.
[22:19:46] <BBB> :)
[22:19:49] <Dark_Shikari> Remember when I said "retarded monkeys"?
[22:19:52] <Dark_Shikari> That.
[22:19:58] <Yuvi> would pmullw be faster here than pmaddwd?
[22:20:08] <Dark_Shikari> Yuvi: pmaddwd is the same speed and gives you a free add
[22:20:11] <Dark_Shikari> and saves you two registers
[22:20:19] <Yuvi> but you're adding 0 aren't you?
[22:20:23] <Dark_Shikari> no
[22:20:28] <Dark_Shikari> note the interleaving
[22:20:56] <BBB> Yuvi: I'm doing several pixels at the same time, to take advantage of pmaddwd
[22:21:08] <Yuvi> hm, so punpck -> pmadd -> paddd vs. pmullw -> paddsw ?
[22:21:23] <Dark_Shikari> Yuvi: pmaddwd requires
[22:21:25] <Yuvi> BBB: you're always doing that in simd though
[22:21:42] <Dark_Shikari> 4x punpck, 4x pmadd, 2x padd
[22:21:45] <Dark_Shikari> pmullw requires
[22:21:54] <Dark_Shikari> 4x punpck, 4x pmullw, 4x padd
[22:21:56] <Dark_Shikari> and 2 more registers
[22:22:26] <Dark_Shikari> because you need 4 registers for masks instead of 2
[22:22:28] <BBB> hmm... wife is calling
[22:23:07] <BBB> Yuvi: don't worry, I won't commit yet, this is a little useless
[22:23:12] <BBB> it was just the easiest to implement
[22:23:36] <BBB> I'll work on h6, the v4/6 and the h4/6v4/6 variants also
[22:24:12] <Dark_Shikari> this is a very good start
[22:24:15] <Dark_Shikari> top-quality asm function
[22:24:15] <BBB> I can hopefully reuse a little of this code in some others
[22:24:17] <Dark_Shikari> well optimized.
[22:24:31] <Dark_Shikari> you will be able to do h6 without spilling
[22:24:45] <Dark_Shikari> since you can do pmaddwd with memory if needed, and round with memory
[22:24:46] <Dark_Shikari> to save regs
[22:25:06] <BBB> I can simply do the same I do here right?
[22:25:11] <BBB> just repeat it three times instead of two
[22:25:14] <Dark_Shikari> yeah
[22:25:18] <BBB> since I add mm1 to mm0, I simply reuse mm1
[22:25:36] <BBB> I just need one more reg for the 5th/6th filter coeffs
[22:26:03] <BBB> I'll round with memory then
[22:26:05] <BBB> simplest
[22:26:11] <BBB> ok, that's tonight/tomorrow
[22:26:13] <BBB> off for now
[22:26:18] <saintdev> what is spilling? haven't been able to figure that out from context yet.
[22:27:11] <mru> saving regs to stack
[22:28:10] <Dark_Shikari> spilling is what you have to do when you run out of registers
[22:28:13] <Dark_Shikari> BBB: or you use pavgw trick
[22:28:17] <Dark_Shikari> and then you don't need a constant at all
[22:28:25] <Yuvi> Dark_Shikari: http://pastie.org/1009306 <- like that should be the same, no?
[22:28:29] <saintdev> effectively a push/pop with a simd reg?
[22:28:48] <Yuvi> which has the disadvantage of more loads for the filter
[22:29:54] <Yuvi> or am I missing a different reason why pmaddwd is faster here?
[22:30:59] <Dark_Shikari> Yuvi: fewer ops
[22:31:00] <Dark_Shikari> period
[22:31:07] <Dark_Shikari> fewer adds, fewer memory references
[22:31:30] <Yuvi> I'm not seeing the fewer adds
[22:31:50] <Dark_Shikari> oh you're right
[22:31:53] <Dark_Shikari> it's just fewer regs
[22:31:57] <Dark_Shikari> Either way, it's certainly not worse.
[22:32:09] <Dark_Shikari> and when he moves to pmaddubsw, it'll be easier to base that code on the current code
[22:33:37] <Dark_Shikari> same with sse
[22:33:44] <Dark_Shikari> with sse, this trick will let a 4x4 block be done incredibly quickly
[22:33:49] <Dark_Shikari> with just two multiplies
[22:33:57] <Dark_Shikari> per row
[22:34:47] <CIA-98> ffmpeg: cehoyos * r23640 /trunk/libavfilter/vsrc_buffer.c:
[22:34:47] <CIA-98> ffmpeg: Use enum PixelFormat to silence one icc warning:
[22:34:47] <CIA-98> ffmpeg: warning #188: enumerated type mixed with another type
[22:34:47] <CIA-98> ffmpeg:  enum PixelFormat pix_fmts[] = { c->pix_fmt, PIX_FMT_NONE };
[22:34:47] <CIA-98> ffmpeg:  ^
[22:34:57] <Yuvi> true, pmaddubsw will work a lot better
[22:36:00] <Dark_Shikari> pmaddubsw will let us do an 8-pixel row in two multiplies
[22:36:17] <Dark_Shikari> one 16-byte load, two pshufb, two multiplies, one add
[22:36:37] <Dark_Shikari> compared to the current 8xpack 8xmult 4xadd or so


More information about the FFmpeg-devel-irc mailing list