[FFmpeg-devel-irc] IRC log for 2010-06-23

Thu Jun 24 02:00:12 CEST 2010

[00:00:11] <Honoome> mru: http://paste.pocoo.org/show/228757/ sorted in .bss size, ascending ;)
[00:00:59] <mru> feel free to make them smaller
[00:01:44] <Honoome> I guess the only way would be to write more hardcoded tables, thus have more of them generate their hardcoded tables :P
[00:02:28] <mru> which is usually even worse
[00:02:41] <mru> except on mmu-less systems
[00:02:50] <michaedw> mru: I work for this person: http://us.experteer.com/job_catalog/job/358969
[00:03:13] <mru> ugh, a suit
[00:03:44] * mru works for himself
[00:03:46] <Honoome> mru: well it reduces the per-process resident memory, for sytsems with <=4GB of RAM and the amount of processes that load ffmpeg nowadays, it's not too bad
[00:03:49] <michaedw> she's a couple notches up from the typical Cisco suit, at least from my interactions with her so far
[00:04:07] <michaedw> ok, you got me there.  I work for her, insofar as I work for anyone.  :-)
[00:04:17] <mru> Honoome: most ffmpeg runs will use only a small number of codecs
[00:06:16] <mru> reminds me, my contract with ARM was signed by the CEO
[00:06:23] <mru> I wasn't expecting that
[00:06:51] <kierank> mechanical pen ;)
[00:07:16] <mru> two documents, not the exact same signature
[00:08:30] * mru deletes some more cruft
[00:09:11] <CIA-99> ffmpeg: mru * r23731 /trunk/ (configure libavcodec/os2thread.c libavcodec/Makefile):
[00:09:11] <CIA-99> ffmpeg: Remove OS/2 threads support
[00:09:11] <CIA-99> ffmpeg: OS/2 SMP support is rare, and a pthreads library exists.
[00:09:11] <CIA-99> ffmpeg: No need to keep this code.
[00:11:43] <Dark_Shikari> michaedw: fyi
[00:12:00] <Dark_Shikari> michael offered to do that stupid baseline feature... what was it
[00:12:01] <Dark_Shikari> FMO
[00:12:03] <Dark_Shikari> for $10k
[00:12:42] <Kovensky> that's how much he wants it done?
[00:13:26] <Dark_Shikari> no, that's how much he wants to do it
[00:13:37] <Dark_Shikari> *how much money that he wants in order to do it
[00:13:42] <Dark_Shikari> someone asked him for FMO
[00:14:03] <Kovensky> the more he asks, the more he doesn't want to do
[00:14:11] <Kovensky> :>
[00:14:39] <spaam> FMO?
[00:14:41] <Kovensky> if he actually wanted to do it he'd have done it already :)
[00:14:47] <mru> flexible macroblock ordering
[00:14:51] <spaam> ok :)
[00:15:14] <michaedw> that would also be interesting
[00:15:18] <Honoome> from the name one could guess why he might not want to do that... the "flexible" sounds like a buzzword
[00:15:26] <Kovensky> and it's a baseline-only feature
[00:15:52] <michaedw> my impression is that there's more value in GDR
[00:15:58] <mru> what's the point of having features baseline-only?
[00:16:00] <michaedw> but I haven't really applied science to it yet :-)
[00:16:18] <Dark_Shikari> it lets you have slices of arbitrary shape
[00:16:19] <Dark_Shikari> (in MBs)
[00:16:30] <Dark_Shikari> useful application (would be more useful if not baseline-only): rectangular slices
[00:16:33] <Dark_Shikari> i.e. 4 squares
[00:16:48] <Dark_Shikari> to maximize (area) / (perimeter)
[00:17:07] <mru> I can see some use for fmo as such
[00:17:10] <michaedw> it's in extended also
[00:17:14] <michaedw> just not in main
[00:17:16] <Dark_Shikari> extended doesn't exist
[00:17:17] <mru> but why baseline only?
[00:17:28] <Dark_Shikari> mru: to make main/high less stupidly complicated
[00:19:23] <michaedw> I'm far from expert in this area; I'm mostly just looking for low-hanging fruit relative to where we are today
[00:19:50] <Dark_Shikari> the h264 decoder has been optimized pretty heavily
[00:20:02] <michaedw> preferably things that can be tried, on an experimental basis, without mucking about in the hardware acceleration we've got
[00:21:34] <Dark_Shikari> You could come try to optimize x264 as well, though we've got that one locked up pretty tight optimization-wise =p
[00:21:42] <michaedw> that's where the data partitioning recoder idea came from; unpeel the bytestream we've got down to the macroblocks, split up according to the DP scheme, apply UEP
[00:23:18] <michaedw> the experiment's been done, with a crude but not wholly irrelevant metric: http://users.elis.ugent.be/~pbertels/zijspoor/phd/2006_a082_Stefaan_Mys.pdf
[00:23:34] <mru> http://article.gmane.org/gmane.comp.video.ffmpeg.devel/104238
[00:24:16] <Dark_Shikari> there's also that one I posted a while back
[00:24:35] <michaedw> that's good guidance, thank you
[00:24:59] <mru> http://thread.gmane.org/gmane.comp.video.ffmpeg.devel/79246
[00:26:40] <michaedw> mm, the golomb simplification is interesting
[00:27:01] <michaedw> especially since in DP all the golomb is in DP A
[00:27:12] <Dark_Shikari> that was already done, it's not as good as I thought
[00:27:15] <Dark_Shikari> because I was thinking encoder-side
[00:27:19] <astrange> if x264 split loop filter strength and actual filtering, port that
[00:27:29] <Dark_Shikari> ffmpeg already does that
[00:27:32] <Dark_Shikari> loop_filter_fast
[00:27:33] <Dark_Shikari> oh
[00:27:40] <astrange> skal's idea
[00:27:43] <Dark_Shikari> oh actually that change is huge, yeah, that should be #1 priority
[00:27:48] <Dark_Shikari> skal's idea is bloody brilliant
[00:28:09] <michaedw> link?
[00:28:42] <Dark_Shikari> it's simple:
[00:28:55] <Dark_Shikari> 1) calculate deblock strength while the nnz, mvs, etc are still in the cache structure
[00:29:09] <Dark_Shikari> (this requires some reloading of the cache if deblock-across-slice-edges is on, and we're on a slice edge)
[00:29:14] <Dark_Shikari> 2) store the results
[00:29:23] <Dark_Shikari> 3) once the row is done, deblock it using the stored strength values
[00:29:32] <Dark_Shikari> currently, we deblock per-row in ffmpeg, but it does the strength calculation per-row too
[00:29:37] <Dark_Shikari> instead of doing it when the values are still in the cache
[00:30:06] <Dark_Shikari> this is a bit tricky with all the stupid rules in h264.
[00:31:20] <michaedw> hmm, that's interesting; even more interesting to me is the instrumentation for measuring the effect
[00:31:32] <Dark_Shikari> rdtsc?
[00:31:37] <Dark_Shikari> instrumentation is easy
[00:31:48] <astrange> grep START_TIMER
[00:32:05] <michaedw> specifically, the cache behavior
[00:33:12] <Dark_Shikari> well, this saves some cache, but that's not the primary purpose
[00:33:12] <michaedw> we have other crap going on in the system, and it may be worth going to some effort to keep ffmpeg (if we go that route) from being evicted when it's working with that cache structure
[00:33:15] <Dark_Shikari> here "cache" does not mean "CPU cache"
[00:33:19] <michaedw> understood
[00:33:36] <Dark_Shikari> this would be a win on a CPU with 10 gigs of L1 cache
[00:34:04] <mru> can I have one of those?
[00:34:22] <Dark_Shikari> mru: easy.  make a cpu with no cache, attach to 10 gigs of ram
[00:34:26] <Dark_Shikari> the ram is now the first-level cache ;)
[00:34:27] <Honoome> rotfl
[00:34:59] <mru> so I just need to disable the caches on my i7
[00:35:09] <astrange> hm, patch on the gcc list from amd today uses dct_unquantize_h263_inter_c as the testcase
[00:35:10] <Dark_Shikari> and add a few more gigs of ram
[00:35:17] <Dark_Shikari> astrange: link?
[00:35:21] <mru> astrange: link?
[00:35:23] <Dark_Shikari> Oh, THAT function
[00:35:26] <Dark_Shikari> wasn't that the one that broke AVR32?
[00:35:28] <mru> yep
[00:36:42] * Honoome is tempted to implement FIONREAD syscall for SCTP tonight
[00:37:05] <Honoome> too bad that I can't unload sctp module for some reason =_=
[00:37:28] <mru> open socket?
[00:37:31] <michaedw> We've got some folks here who are spinning up on measurement techniques on embedded Linux and need good examples to live through
[00:37:46] <astrange> http://article.gmane.org/gmane.comp.gcc.devel/115166/ can't find the patch thread on gmane yet
[00:37:58] <astrange> and they must have turned tree-vectorize back on?
[00:38:25] <mru> turning it on to work on gcc is excusable
[00:38:25] <Honoome> mru: feng is closed and I don't think anything else uses sctp on my system
[00:38:32] <mru> Honoome: netstat
[00:38:41] <Dark_Shikari> >dct kernel
[00:38:54] <Honoome> I guess I could just set up a slackware vm to work on the kernel ...
[00:38:55] <mru> that's not a dct
[00:39:08] <michaedw> be aware that netstat tells you some process that has the socket open, not necessarily the only one
[00:39:12] <Dark_Shikari> it has DCT in the name!11!
[00:39:21] <Dark_Shikari> Fuck, I'm way too used to using pmaddubsw
[00:39:24] <Dark_Shikari> I start writing an sse2 MC function
[00:39:29] <michaedw> we had fun recently with missing O_CLOEXEC flags
[00:39:29] <Dark_Shikari> and I inadvertantly write an ssse3 one instead
[00:39:35] <Dark_Shikari> without even realizing it
[00:39:40] <Honoome> netstat reports no sctp sockets open.. WTH
[00:39:57] <mru> lsmod shows non-zero use count?
[00:40:04] <Honoome> yah
[00:40:16] <mru> no depending modules?
[00:40:36] <mru> hmm... intriguing
[00:40:49] <mru> on avr32 it did move the mult out of the branch
[00:40:58] <mru> that's how it triggered the hw bug
[00:42:02] <astrange> hmm it would be better to put block[i] = level in both arms of the if/else instead of afterwards
[00:42:18] <Dark_Shikari> It might be better to just remove/restore sign
[00:42:22] <astrange> decode_cabac_residual does that and it saves one branch (since they can't both fallthrough to it)
[00:42:31] <Dark_Shikari> pabsw/psignw
[00:43:41] <michaedw> Dark_Shikari: can x264 do constrained intra prediction?
[00:43:59] <Dark_Shikari> yes
[00:44:02] <Dark_Shikari> --constrained-intra
[00:44:03] <Dark_Shikari> read the help
[00:45:15] <michaedw> is that relevant only when doing GDR?
[00:46:03] <Dark_Shikari> ?
[00:46:19] <michaedw> sorry, I guess this is more #ffmpeg material
[00:46:32] * Honoome laughs because GDR in Italian means "RPG" :P
[00:46:34] <Dark_Shikari> I have no idea what GDR is.
[00:46:36] <Dark_Shikari> Feel free to ask here
[00:46:40] <Dark_Shikari> I'm just wondering wtf you're on about.
[00:47:28] <mru> this a typical case of gcc optimisation fragility
[00:47:40] <mru> many of their optimisations seem to be greedy
[00:48:19] <mru> so a small optimisation can inhibit a much better one just because it happes to be applied first
[00:49:15] <michaedw> "gradual decoding refresh", and actually it's not quite what I meant; I meant to say, constrained intra is only relevant when you have frames with mixed I and P slices, right?
[00:49:24] <saintdev> mru: kind of sounds like what you were saying about MN yesterday :P
[00:50:24] <Dark_Shikari> michaedw: no, frames with mixed I and P blocks
[00:50:25] <Dark_Shikari> i.e. every frame
[00:50:31] <Dark_Shikari> saintdev: hahahahah
[00:50:36] <Dark_Shikari> michael is a greedy optimization ;)
[00:51:49] <michaedw> right, so not relevant for IDRs, which is kind of what I meant in the first place :-P
[00:53:12] <j0sh_> i've only seen gdr in video calls. makes sense that it'd be used there
[00:53:13] * Honoome hates on libvirt, virt-manager, redhat, ...
[00:57:17] <mru> Honoome: maybe you'll like this :-) http://rwmj.wordpress.com/
[00:57:37] <Honoome> yes I know Rich
[00:57:58] <michaedw> j0sh_: low-latency streaming generally
[00:58:08] <michaedw> IDR frames are pretty big lumps
[00:58:15] <j0sh_> yup
[00:58:24] <Dark_Shikari> it's called intra refresh
[00:59:25] <michaedw> Dark_Shikari: sure; I'm just looking for a mode that my hardware codec can do and I can still post-process into a DP stream
[00:59:43] <Dark_Shikari> hardware codec?  why use a hardware codec and then try to postprocess it?
[00:59:48] <Dark_Shikari> why not just use a good encoder to begin with?
[01:00:04] <michaedw> Dark_Shikari: that's the point that I'm trying to prove
[01:00:19] <Dark_Shikari> ?
[01:00:38] <michaedw> first, what's achievable without re-plumbing things so that we can use a current codec
[01:00:52] <michaedw> and second, how much more is achievable without that hobble
[01:00:53] <Dark_Shikari> I don't see how data partitioning comes into this
[01:00:58] <michaedw> FEC
[01:01:11] <Dark_Shikari> The whole poitn of FEC is so you don't need DP.
[01:01:26] <michaedw> unequal degrees of FEC on different data partitions
[01:01:38] <mru> with dp you can apply more error correction to the important parts
[01:02:01] <michaedw> right -- and pick different overhead/latency trade-offs
[01:02:27] <Dark_Shikari> Yeah, except for that whole "intra prediction means that dct coeffs become the important part"
[01:02:41] <Dark_Shikari> er, latency?  you can't have more latency on some parts of the frame than others
[01:02:44] <Dark_Shikari> that doesn't make sense
[01:02:51] <Dark_Shikari> you can't decode a frame until you have all the parts: coeffs, modes, etc
[01:03:20] <Dark_Shikari> SVC seems like a much better option if you're trying to come up with reasonable FEC strategies for 40% loss rates
[01:03:33] <michaedw> DP A is small enough that we can probably afford to spend bandwidth on FEC over small sets
[01:03:33] <Dark_Shikari> An equally good option would be taking your connection and burning it with napalm
[01:03:53] <michaedw> the problem isn't the overall loss rate, it's the burstiness
[01:04:06] <Honoome> ... oh god I've done the worst thing I could do...
[01:04:27] <Dark_Shikari> if loss is bursty, it can be resolved with interactive encoder control
[01:04:32] <mru> Honoome: that's good, now it can only get better
[01:04:43] <michaedw> partition C is the last to arrive, and the cheapest to drop on the floor
[01:04:49] <Honoome> mru: I decided to watch Apple's WWDC keynote...
[01:05:01] <mru> Honoome: where did you get such a crazy idea?
[01:05:20] <Dark_Shikari> michaedw: why not use SVC if you're trying to do that?
[01:05:20] <Honoome> there's the zynga guy... now I feel a primal instinct to kill all the people who waste their time and money to such a dolt...
[01:05:24] <Dark_Shikari> that's actually built for that
[01:05:41] <mru> Honoome: I don't watch _any_ keynotes
[01:05:44] <Honoome> it's a matter of species improvement... my instinct i mean
[01:05:47] <michaedw> turns out it's not very useful for talking heads, at least from what I've read / been told
[01:06:08] <Honoome> mru: I generally tend to, good thing to know what they try to shove down your throat and _how_ they do it...
[01:06:31] <mru> I ignore apple as much as possible
[01:06:54] <Dark_Shikari> michaedw: what isn't
[01:06:58] <michaedw> SVC
[01:07:03] <Honoome> mru: I would switch over to apple development rather than rails development
[01:07:06] <Dark_Shikari> for talking heads, I'd use interactive encoder control
[01:07:10] <Dark_Shikari> if you have a drop in frame N
[01:07:15] <mru> Honoome: I refuse to do either
[01:07:15] <Dark_Shikari> tell the encoder to invalid all frames N and afterwards
[01:07:18] <Dark_Shikari> and use a reference frame from before N
[01:07:24] <mru> if I can't do it in linux, I don't do it
[01:08:00] <michaedw> apple has been know to raise the bar for fit-and-finish, which is healthy
[01:08:10] <mru> bull
[01:08:22] <Honoome> wish I could do as much :/ not good enough to pretend though
[01:08:22] <Dark_Shikari> >apple
[01:08:24] <Dark_Shikari> >raise the bar
[01:08:27] <Dark_Shikari> *headdesk*
[01:08:44] <mru> who declared copy&paste unnecessary?
[01:08:49] <mru> .. only to backtrack later
[01:08:55] <Dark_Shikari> who declared high quality video encoding unnecessary?
[01:08:56] <mru> who declared multitasking useless?
[01:09:03] <mru> ... only to backtrack
[01:09:11] <mru> who uses 20-year old dev tools?
[01:09:16] <mru> and has yet to update
[01:09:16] <Dark_Shikari> 22!
[01:09:23] <michaedw> and the sheer longevity of their hardware is impressive; the freebie iPod mini I gave my daughter is still going strong
[01:09:28] <astrange> the new dev tools are called llvm-mc
[01:09:37] <Kovensky> <@mru> who declared copy&paste unnecessary? <-- didn't microsoft do it for winmo7 too
[01:09:40] <Dark_Shikari> the ipod mini hasn't been around long enough to have "longevity"
[01:09:41] <mru> who charges developers money to code for their platform?
[01:09:53] <Dark_Shikari> tell me about longevity when it's still around in 2035
[01:10:07] <Dark_Shikari> until then, I'll enjoy the imacs in our computer lab
[01:10:11] <Dark_Shikari> which overheat and crash almost daily
[01:10:18] <Dark_Shikari> you can burn yourself by touching their case
[01:10:26] <michaedw> 6-year-old spinning rust handled daily by a 5-year-old
[01:10:58] <Honoome> mru: technically, Microsoft as well..
[01:11:11] <michaedw> doesn't compare with the RS6000 gear I parted with recently, but this is consumer electronics
[01:11:19] <Dark_Shikari> my TI-92 lasted longer than that.
[01:11:20] <mru> but who was f1rst with the "innovation"?
[01:12:05] <michaedw> they do have some of the stupidest billboards in creation, I grant
[01:12:46] * Dark_Shikari can't keep track of left/right-shifts of entire registers in simd on little-endian
[01:12:53] <Honoome> mru: and you forgot "who declared tethering useless (and then backtrack)"?
[01:13:16] <Dark_Shikari> who declared unlimited data to be great
[01:13:18] <Dark_Shikari> and then backtrack
[01:13:38] <astrange> not apple?
[01:13:42] <mru> google did
[01:13:45] <Honoome> Dark_Shikari: half the mobile telcos on the planet?
[01:13:49] <astrange> there are other countries without at&t
[01:14:03] <mru> android <2.2 doesn't have tethering built-in
[01:14:14] <michaedw> Dark_Shikari: I like my HP-35 :-)
[01:14:14] <mru> it's always been possible with 3rd-party apps
[01:14:48] <michaedw> although I think my brother ran off with it last time he visited
[01:14:49] * mru is currently angry with vodafone for raising data roaming price 5x
[01:15:01] <michaedw> communication is overrated
[01:15:06] <Honoome> mru: android didn't have tethering because iphone didn't in the first place
[01:15:11] <mru> and despite that, they're still the cheapest of the uk operators
[01:15:25] <Honoome> and telcos thought they could fetch more money with hsdpa datacards than phones
[01:15:25] <Dark_Shikari> everyone else raised prices too?
[01:15:38] <Dark_Shikari> also, mru
[01:15:39] <Dark_Shikari> >UK
[01:15:41] <Dark_Shikari> there's your problem
[01:15:49] <mru> all eu is the same
[01:16:28] <Honoome> yeah they increased prices in italy as well
[01:16:30] <Dark_Shikari> I thought UK internet was shit?
[01:16:42] <Dark_Shikari> vs say sweden
[01:16:51] <Dark_Shikari> or for that matter, denmark
[01:16:52] <mru> I'm talking about mobile data
[01:16:57] <verb3k_> vs netherlands
[01:17:02] <Dark_Shikari> or netherlands, yeah
[01:17:11] <michaedw> I expect that android didn't have tethering initially because of the unholy mix of Qualcomm, HTC, T-Mobile, Android, and normal Google engineering staff involved in getting it off the ground
[01:17:14] <Honoome> 3ITA used to have a very good price (â‚¬80/mo, 20GB data)
[01:17:14] <mru> specifically when roaming
[01:17:18] * j0sh_ is waiting for wimax in his area...
[01:17:43] <Honoome> now they moved to same price, 2GB data
[01:17:49] <Honoome> or â‚¬150 and 20GB
[01:17:57] <mru> hsdpa works great here
[01:18:09] <mru> and I get a good flatrate within the uk
[01:18:28] <mru> as soon as I cross a border they charge Â£1/MB
[01:18:38] <mru> even in their own networks
[01:19:13] <kierank> yes that is a joke
[01:19:14] <Honoome> wow... at least 3 is still "same price if under another 3 network"
[01:19:17] <michaedw> and because of the odd way that T-Mobile handled the gateway/proxy pool at launch time
[01:19:18] <kierank> when you move from orange uk to orange fr
[01:19:31] <Honoome> but I get an even higher rate if I connect, say, in Switzerland
[01:19:54] <mru> they jacked up the prices now because the eu introduced a cap on call prices
[01:20:06] <mru> so they shifted it to data instead
[01:20:17] * mru curses eu for ruining everything
[01:20:21] <Honoome> on the other hand, I spent a grand total of â‚¬50 when I was at FOSDEM... and I used Google Maps extensively
[01:20:30] <Honoome> mru: that's _so_ british of you ;)
[01:20:32] <mru> it was bearable before the cap
[01:20:38] <mru> now I can't afford it
[01:21:04] * Honoome gets slack ... brr
[01:21:14] <kierank> Honoome: the relationship is love-hate
[01:21:20] <Honoome> kierank: with what?
[01:21:22] <kierank> eu
[01:21:27] <mru> why oh why didn't they cap data as well?
[01:21:32] <michaedw> I only half-like my CrackBerry, but around here Verizon has the best coverage (and the only tolerable customer service)
[01:21:50] <Dark_Shikari> that's the difference between verizon and AT&T
[01:21:53] <Dark_Shikari> AT&T is incompetent evil
[01:21:56] <Dark_Shikari> verizon is highly competent evil
[01:21:57] <mru> I trolled the vodafone shop in town the other day
[01:22:01] <Dark_Shikari> both are incredibly evil, but verizon is at least good at it
[01:22:04] <mru> they denied all knowledge a 5x price hike
[01:22:12] <Dark_Shikari> mru: show them your bill?
[01:22:15] <kierank> Dark_Shikari: were bell labs evil?
[01:22:20] <kierank> in your book
[01:22:24] <michaedw> I've seen more competent evil than Verizon :-)
[01:22:26] <Dark_Shikari> kierank: "bell labs"?
[01:22:29] <mru> I pointed them to their own web page
[01:22:37] <Dark_Shikari> Bell was the company
[01:22:40] <michaedw> but they give value for money, at least in my experience
[01:22:45] <Dark_Shikari> And yes, back when it was "ma bell", they were rather evil
[01:22:59] <Honoome> kierank: I live in IT... EU doesn't look _too_ bad after
[01:22:59] <Dark_Shikari> Verizon is evil because they are even more lock-you-down-and-screw-you than AT&T
[01:23:09] <Dark_Shikari> It's just that they actually have good coverage.
[01:23:12] <michaedw> they're fairly evil on the regulatory front; all the baby bells are
[01:23:17] <Dark_Shikari> Whereas AT&T is similar, but with awful coverage.
[01:23:25] <Dark_Shikari> woohoo, vp8 h4 sse2 written.
[01:23:37] <michaedw> ask anyone who worked for, or supplied gear to, a CLEC
[01:24:19] <michaedw> I seem to recall that Verizon has perfectly good pay-as-you-go options too
[01:24:53] <mru> everything phone and net related seems to be total shit in the US
[01:24:58] <peloverde> I get shitty reception in bars (with beer, not units of signal strength) with verizon
[01:25:22] <peloverde> (or when I had verizon)
[01:25:30] <michaedw> Verizon coverage is good enough that I ditched my land line
[01:25:46] <michaedw> bye-bye, godawful SBC customer service
[01:25:50] <michaedw> (it was a while ago)
[01:26:03] <mru> land line is the only viable option for international calls here
[01:26:13] <mru> but you guys don't know what international means
[01:26:15] <Dark_Shikari> s/here/anywhere
[01:26:16] <michaedw> peloverde: that's probably CDMA vs. GSM
[01:26:40] <michaedw> why would you call internationally through the phone system?
[01:26:41] <Honoome> mru: I have international calls in zone 1 at local rates :P
[01:27:11] <mru> and what's zone 1? italy and sicily?
[01:27:20] <michaedw> we call in-laws in Russia that way sometimes, out of sheer laziness
[01:27:21] <Honoome> mru: europe and north america
[01:27:21] <kierank> O2 telephone is quite good
[01:27:26] <kierank> unlimited calls to europe and us
[01:27:34] <michaedw> but mostly we use Skype
[01:27:41] <mru> kierank: land line?
[01:27:45] <kierank> yes landline
[01:27:56] <michaedw> yes, evil, but functional
[01:27:58] <ohsix> eh, you shouldn't have to do anything like that, your distro already does it for you (good ones, anyways)
[01:28:27] <michaedw> sadly, Skype is still far, far more robust over crappy networks than any alternative I've used
[01:28:43] <michaedw> although I fully expect that's temporary
[01:28:44] <mru> ohsix: uh what?
[01:29:02] <michaedw> and have every intention of contributing to making it so
[01:29:04] <Honoome> call internationally I guess
[01:29:34] <kierank> skype video still sucks
[01:29:38] <michaedw> speaking of which, that DP recoder ...
[01:29:48] <mru> video calls suck, period
[01:29:59] <ohsix> misfire
[01:29:59] <michaedw> mru: why?
[01:30:14] <kierank> blocky
[01:30:17] <mru> I simply see _no_ use for them
[01:30:23] <michaedw> they seem to engage my kids much more fully than voice alone
[01:30:33] <mru> kids these days...
[01:30:37] <Dark_Shikari> what are kids
[01:30:49] <Honoome> ohsix: let me guess.. #pulseaudio?
[01:30:54] <mru> Dark_Shikari: annoying, whiny little bastards
[01:30:57] <michaedw> they enjoy the granddad-cam and the cousin-cam
[01:31:16] <Dark_Shikari> mru: no, a miserable little pile of secrets
[01:31:19] <Dark_Shikari> enough talk, have at you
[01:31:33] <michaedw> Dark_Shikari: future sources of good schadenfreude, when they have kids of their own
[01:31:33] <ohsix> Honoome: ya he's going through some ancient wiki page to mess with stuff when his drivers timing is busted
[01:31:51] <Dark_Shikari> michaedw: haha
[01:32:05] <mru> is there such a thing as an up to date wiki?
[01:32:29] <michaedw> which is why I go to some trouble to hook mine up with their granddad; he earned his schadenfreude
[01:32:41] <Honoome> 200kb/s to fetch slackware... I guess I'll kernel-hack another day
[01:32:51] <ohsix> mru: definitely not; some pages should be nuked from orbit hours after they are written :P
[01:33:15] <mru> s/after/before/
[01:33:21] <kierank> Honoome: use another mirror...
[01:33:32] <ohsix> but when you're talking "linux" "help", with adhoc nonsolutions and bodges in the first place; people gravitate to them like they contain real information
[01:33:35] <kierank> the belgian ones are pretty fast
[01:33:37] <mru> mirror, mirror on the wall...
[01:33:57] <Honoome> kierank: torrent, after trying both italian mirrors, then another one at random, and all three failed, I've decided to give the torrents a try :/
[01:34:00] <mru> ohsix: could be worse, could be webfourms
[01:34:14] <Honoome> mru: oh god nooooo
[01:34:37] <ohsix> yea, heh; i have a personal beef with forums, front line wall of noise defense for just about anything you'll find them attached to
[01:34:50] <ohsix> but Live! with lots of people that pile in with no real intention to help
[01:34:57] <mru> where on page 7, someone discovers that the fix actually causes catastrophic damage in some subtle way
[01:35:07] <ohsix> huhu
[01:35:14] <ohsix> reverse topposting
[01:35:47] <Honoome> rotfl
[01:36:07] <mru> "I had that problem too.  Just disable unrelated $foo and it'll magically work"
[01:36:40] <mru> "I tried that and now $foo doesn't work either plz hlp"
[01:36:56] <Dark_Shikari> "please send me teh codes"
[01:37:08] <michaedw> back to an earlier question: does the fact that intra_gb_ptr is only used in ff_h264_decode_mb_cavlc() mean that ffmpeg's h.264 decoder doesn't handle the combination of CABAC and data partitioning?
[01:37:16] <ohsix> i deleted some stuff in /lib and it worked
[01:37:17] <Dark_Shikari> michaedw: ffmpeg doesn't do data partitioning
[01:37:22] <Dark_Shikari> period
[01:37:31] <Dark_Shikari> ffmpeg doesn't do anything in baseline that isn't in main
[01:37:39] <ohsix> even with something like apt and dpkg-divert in play; people are still butchering stuff
[01:38:07] <michaedw> it handles the DP nal_unit_types
[01:38:39] <Dark_Shikari> BBB: you have vararrays in your dsp code
[01:38:43] <Dark_Shikari> oh, he's not here.
[01:38:45] <mru> nooooooooooooooo
[01:38:57] <mru> I'm going to kill them all and make it an error
[01:39:06] <Dark_Shikari> uint8_t tmp_arr[stride * (height + TAPNUMY - 1)]
[01:39:12] <Dark_Shikari> fortunately that's easy to fix
[01:39:15] <Dark_Shikari> set it to 16
[01:39:19] <mru> set it to max
[01:39:37] <mru> there's no reason to ever allocate an array on stack smaller than max
[01:39:43] <mru> since you have to cope with max anyhow
[01:40:07] <Dark_Shikari> Actually, in this case, max for width4 is 4
[01:40:09] <Dark_Shikari> max for width8 is 16
[01:40:11] <Dark_Shikari> max for width16 is 16
[01:40:51] <mru> 16 is perfectly acceptable to allocate on stack
[01:40:59] <mru> unconditionally
[01:41:21] <Dark_Shikari> w8/w16/w4 are separate functions
[01:47:02] <michaedw> it looks like the code that manipulates intra_gb_ptr has been there a long, long time
[01:48:40] <Honoome> okay I guess I'll leave this be and read something (if anybody is looking for book suggestions, Jim Butcher's Dresden Files are quite good)
[01:49:30] <michaedw> I'm guessing that michael is the only one who has state on that code
[02:00:11] <mru> Dark_Shikari: why don't you fix those vlas?
[02:01:17] <Dark_Shikari> I did
[02:01:28] <Dark_Shikari> Oh wait
[02:01:30] <Dark_Shikari> Shit, you're right.
[02:01:32] <Dark_Shikari> >stride
[02:01:33] <Dark_Shikari> WHAT THE FUCK
[02:01:33] <Dark_Shikari> WHAT
[02:01:36] <Dark_Shikari> THE
[02:01:38] <Dark_Shikari> FUCK
[02:01:41] <Dark_Shikari> WHAAAAAAAAAAAAAT
[02:02:15] <mru> defuck it please
[02:03:19] <Dark_Shikari> I will
[02:03:20] <Dark_Shikari> one moment
[02:03:29] <mru> thanks
[02:04:13] <Dark_Shikari> actually this is a serious problem with the asm
[02:04:17] <Dark_Shikari> it assumes output stride == input stride
[02:04:26] <mru> ungood
[02:04:29] <Dark_Shikari> this will take a significant amount of defucking
[02:04:32] <mru> if it needs temp buffers
[02:04:44] <Dark_Shikari> I will make BBB do this
[02:04:48] <Dark_Shikari> because he fucked it up
[02:04:56] <mru> I'll troll him as soon as I see him
[02:05:44] <CIA-99> ffmpeg: michael * r23732 /trunk/libavformat/asfdec.c:
[02:05:44] <CIA-99> ffmpeg: Continue after guids in asf after which other guids are possible instead of skiping
[02:05:44] <CIA-99> ffmpeg: over the stored size.
[02:05:44] <CIA-99> ffmpeg: Fixes issue2029
[02:15:08] <michaedw> whom would I ask about the hardware accelerator framework?
[02:15:32] <mru> the lhc guys :-)
[02:15:53] <michaedw> mru: <g>
[02:16:06] <mru> they can accelerate a macroblock to 99.9% of lightspeed
[02:16:14] <Dark_Shikari> and then crash it into a reference frame
[02:16:50] <mru> sometimes they dump core
[02:16:52] <mru> for real
[02:16:52] <michaedw> Step 3: ?   Step 4:  PROFIT!!!
[02:17:23] <michaedw> sorry, that's my Silicon Valley showing }:->
[02:17:27] <kierank> they want to find the higgs field but they don't know if it's tff or bff
[02:17:44] <CIA-99> ffmpeg: mru * r23733 /trunk/configure: Enable pthreads automatically unless w32threads is requested
[02:17:46] <Dark_Shikari> mru: I _hate_ x86 simd prior to ssse3.  like, holy shit
[02:17:47] <Dark_Shikari> http://pastebin.org/353192
[02:17:50] <michaedw> Higgs and the LHC guys are definitely BFF
[02:17:54] <mru> field order == spin?
[02:17:57] <Dark_Shikari> those are two loop cores from a 4-tap horizontal MC function
[02:18:01] <Dark_Shikari> THEY DO THE SAME THING
[02:18:07] <Dark_Shikari> one is with sse2, one is with ssse3
[02:18:25] <Dark_Shikari> the mere existence of an arbitrary shuffle + byte multiply eliminates 2/3 of the code
[02:18:40] <mru> nice
[02:18:48] <michaedw> very nice
[02:19:12] <mru> not that I understand a single line of it
[02:19:20] <michaedw> I wonder how much difference there is at the microcode level
[02:19:25] <Dark_Shikari> michaedw: huge
[02:19:28] <Dark_Shikari> most of these ops are one uop
[02:19:35] <Dark_Shikari> most modern x86 simd is not cisc
[02:19:44] <Dark_Shikari> where "cisc" is defined in this case as "multiple internal uops for one instruction"
[02:19:57] <Dark_Shikari> that generally is only done for emulation of harder simd ops on shitty chips, like atom.
[02:19:58] <mru> the proper term for that is "microcoded"
[02:20:08] <Dark_Shikari> true.
[02:20:29] <mru> a risc ISA _could_ be microcoded
[02:20:31] <Dark_Shikari> equally, they're all one-inverse-throughput-per-execution-unit
[02:20:34] <mru> no sane person would do that though
[02:20:43] <michaedw> I don't mean uops so much as microcode size
[02:20:45] <Dark_Shikari> i.e. each execution unit can do one of these ops per cycle
[02:21:40] <Dark_Shikari> whenever someone is designing an simd instruction set, they need to look at stuff like this
[02:21:48] <mru> but single uops aren't necessarily single-cycle
[02:21:52] <Dark_Shikari> and think "how can we not massively cripple people writing complex functions?"
[02:22:00] <Dark_Shikari> er, s/complex/common
[02:22:47] <ohsix> wont need to buy new stuff if they give you all the goods at once
[02:22:48] <michaedw> trace cache footprint
[02:23:26] <mru> Dark_Shikari: that's what they did when they designed the neon instruction set
[02:25:28] <mru> Dark_Shikari: btw, did you see that commit I just did?
[02:25:42] <Dark_Shikari> which
[02:25:47] <mru> auto-pthreads
[02:26:24] <Dark_Shikari> Wait, it's in?
[02:26:32] <mru> scroll up
[02:27:13] <Dark_Shikari> _awesome_
[02:27:17] <Dark_Shikari> wait, we still support w32threads?
[02:27:26] <mru> apparently
[02:27:38] <mru> I met some resistance trying to kill it
[02:28:32] <Dark_Shikari> who the hell cares?
[02:28:57] <mru> ramiro
[02:29:21] <Dark_Shikari> o.0 he cares?
[02:29:31] <mru> I can't imagine anyone else does
[02:31:02] <michaedw> my knowledge in this area is rather stale, but it would be interesting to see those loop internals translated into micro-ops
[02:31:24] <mru> on a modern x86, pretty much as written
[02:32:19] <michaedw> looks like it.  probably similar on a P-M core, too.  Probably only funky on a P4.
[02:32:33] <mru> and atom
[02:35:08] <michaedw> atom has fewer decode units, but still has micro-ops more like P-M/Core than like P4, I think
[02:35:40] <mru> at least they kept the barrel shifter
[02:35:47] * mru glares at cell
[02:38:23] <michaedw> I would have thought the mova's in the SSE2 version were pretty nearly free
[02:38:46] <mru> what does it do?
[02:39:32] <Dark_Shikari> a mov between registers uses an execution unit just like everything else
[02:40:07] <michaedw> won't they get bonded to the subsequent shift instructions?
[02:40:23] <Dark_Shikari> This is x86, not a magical genie in a cpu.
[02:40:35] <Dark_Shikari> x86 doesn't merge movs and instructions yet.
[02:40:36] <michaedw> I thought that was one of the "micro-op fusion" use cases
[02:40:39] <Dark_Shikari> no
[02:40:42] <Dark_Shikari> cmp/jump is
[02:40:45] <mru> Dark_Shikari: am I dreaming or did you say mov between xmm regs was stupidly slow on some cpu?
[02:40:46] <Dark_Shikari> don't know of any others
[02:40:49] <Dark_Shikari> mru: pentium 4
[02:40:55] <Dark_Shikari> 6 cycles for a mov between mmx/xmm registers
[02:41:00] <Dark_Shikari> 8 cycles for a load from L1
[02:41:07] <mru> lol
[02:41:09] <Dark_Shikari> I wish I was making this up.jpg
[02:41:34] <mru> uop fusion is a paper-only feature
[02:42:39] <mru> if you pay attention, you'll notice that the marketing drivel doesn't say how well anything works
[02:42:42] <mru> only that they have it
[02:43:01] <Dark_Shikari> and gcc loves to reorder in ways that makes it imposisble
[02:43:02] <michaedw> I think you're thinking of macro-op fusion
[02:47:13] <michaedw> I would have thought that micro-op fusion would work the same whether the micro-ops originally came from the same SSE3 instruction or from nearby SSE2 instructions
[02:47:44] <michaedw> at least that's the way that I would have designed it :-)
[02:47:49] <michaedw> for exactly this reason
[02:48:11] <mru> assuming you _could_
[02:48:23] <michaedw> so that the SSE2 version would execute, in practice, as fast on the newer core as the logically equivalent SSE3 version
[02:48:37] <michaedw> register names are just labels in the pipeline anyway
[02:49:23] <michaedw> depends on your instruction scheduling flexibility, of course
[02:49:41] <mru> if it were that simple, why bother adding sse3?
[02:49:43] <Dark_Shikari> ummmmm
[02:49:45] <Dark_Shikari> but it's not logically equivalent
[02:49:46] <Dark_Shikari> ...
[02:49:50] <Dark_Shikari> they're completely different algorithms
[02:49:54] <Dark_Shikari> to solve the same problem
[02:50:05] <Dark_Shikari> it's just that the latter, much simpler one, is made possible by having better instructions available
[02:50:47] <michaedw> sse3 is more of a guide to compiler writers than anything else
[02:50:53] <Dark_Shikari> ssse3 is not sse3
[02:51:14] <mru> will there be an sssse4?
[02:51:40] <Dark_Shikari> no, we're on to sse5 now.
[02:51:44] <Dark_Shikari> and avx
[02:51:47] <michaedw> Dark_Shikari: ah, that's quite right
[02:51:53] <Dark_Shikari> avx == sse 2
[02:51:55] <Dark_Shikari> not sse2, but SSE 2
[02:52:02] <Dark_Shikari> i.e. repeating all the same mistakes of the original SSE
[02:52:25] <mru> avx?
[02:52:25] <michaedw> saturation is helpful
[02:52:36] <Dark_Shikari> mru: 256-bit vectors
[02:53:06] <mru> could occasionally be useful
[02:53:10] <Dark_Shikari> mru: ... float only
[02:53:15] <mru> aaaaiieee
[02:53:18] <Dark_Shikari> Exactly.
[02:53:27] <Dark_Shikari> and it's three-operand.
[02:53:32] <michaedw> that's got to be for FP textures
[02:53:40] <mru> 3-operand is good
[02:53:45] <Dark_Shikari> yup it is
[02:53:50] <Dark_Shikari> they originally announced it as the logical extension of SSE
[02:53:56] <Dark_Shikari> supporting integer and float etc, all the normal sse instructions
[02:53:57] <Dark_Shikari> and then they said
[02:54:03] <Dark_Shikari> "oh, integer is only for the low 128 bits"
[02:54:07] <Dark_Shikari> *headdesk*
[02:54:36] <michaedw> that's not so unreasonable; if you have limited silicon to spend, spend it on something that hasn't already been optimized heavily
[02:54:54] <michaedw> do they have half-float load/store?
[02:55:24] <Dark_Shikari> float "hasn't already been optimized heavily"?
[02:55:37] <Dark_Shikari> "limited silicon" --> how about spend it on things which are far cheaper than float, like integer?
[02:55:46] <mru> with that argument, you should never do anything new
[02:56:02] <michaedw> very low-precision float, for high-dynamic-range texture storage
[02:56:13] <mru> I know what half-float is
[02:56:18] <mru> cortex-a9 supports it
[02:56:31] <michaedw> a la neon fp16
[02:56:37] <michaedw> yep
[02:56:47] <Dark_Shikari> half-float: all the low precision of integers, all the speed of floats
[02:57:20] <michaedw> an elegant solution to certain 3-D rendering problems
[02:57:34] <michaedw> very special-purpose.  happens to be a special purpose that sells chips.
[02:57:36] <mru> an elegant way to shoot yourself in the foot
[02:57:51] <Dark_Shikari> "sells chips"
[02:58:10] <Dark_Shikari> intel loves releasing totally useless instructions that have zero practical application
[02:58:14] <Dark_Shikari> *cough* mpsdabw
[02:58:14] <Dark_Shikari> *mpsadbw
[02:58:20] <michaedw> when memory bandwidth is the only constrained resource, slimming memory bandwidth helps
[02:59:48] <michaedw> makes it possible to keep all your textures in main memory and load/store them with cache-bypassing instructions
[03:00:09] <michaedw> without eating your entire mobile DDR bandwidth
[03:00:23] <mru> so let me get this straight...
[03:00:46] <mru> in the near future, we're supposed to graphics on the cpu and everything else on the gpu?
[03:01:05] <Dark_Shikari> lol
[03:01:23] <michaedw> it's largely irrelevant if you have separate texture memory
[03:01:40] <michaedw> aimed at mobile, mostly
[03:01:43] <saintdev> mru: lmao
[03:01:45] <michaedw> AIUI
[03:02:02] <michaedw> mru: <g>
[03:02:56] * mru waits for the CPCPU
[03:03:22] <michaedw> thanks for conversation and help.  I'll go through roundup and annotate bug reports that appear to boil down to missing/non-functioning h264 parser (like I hit)
[03:03:34] <mru> I doubt there are many
[03:06:00] <saintdev> mru: CP = ?
[03:06:21] <mru> no fucking clue
[03:08:38] <Dark_Shikari> ........... fuck the vp8 interpolation filter coefficients
[03:08:40] <Dark_Shikari> fuck them
[03:08:41] <Dark_Shikari> long and hard
[03:08:55] <Dark_Shikari> I think it's impossible to do the 6-tap filter completely with pmaddubsw
[03:08:56] <mru> what's the problem?
[03:09:11] <mru> what are the coeffs?
[03:09:24] <Dark_Shikari> 2,-11,108,36,-8,1
[03:09:31] <Dark_Shikari> 3,-16,77,77,-16,3
[03:09:37] <michaedw> I suspect this, for instance, is related: http://lists.mplayerhq.hu/pipermail/ffmpeg-user/2010-April/024956.html
[03:09:38] <Dark_Shikari> 1,-8,36,108,-11,2
[03:10:03] <Dark_Shikari> the problem is 108 * X + 36 * Y can saturate a 16-bit signed word
[03:10:21] <Dark_Shikari> But the -11 * Z + -8 * W could reduce it below saturation again.
[03:10:40] <mru> unless those pixels are zero
[03:10:50] <Dark_Shikari> well we're just talking worst-case here.
[03:10:55] <mru> hmm, does vp8 use full or reduced yuv range?
[03:11:02] <Dark_Shikari> no signalling, it supports both
[03:11:10] <Dark_Shikari> you could have X and Y == 255
[03:11:15] <Dark_Shikari> for example
[03:11:22] <Dark_Shikari> or whatever values are necessary to break things
[03:11:54] <mru> why so huge coeffs?
[03:12:07] <Dark_Shikari> No idea.
[03:12:08] <Dark_Shikari> Stupidity
[03:12:12] <Dark_Shikari> I just figured out how to fix this though
[03:12:20] <mru> this for qpel?
[03:12:25] <Dark_Shikari> yes
[03:12:28] <Dark_Shikari> pmaddubsw does two coeffs at a time
[03:12:29] <Dark_Shikari> so if I do
[03:12:43] <michaedw> subtract the middle row from the other two, add them back together after collapsing
[03:12:57] <Dark_Shikari> (-8 * X1 + 36 * X2) + (108 * X3 -11 * X4)
[03:13:02] <Dark_Shikari> That avoids the saturation possibility
[03:13:08] <Dark_Shikari> i.e. different parenthesis grouping
[03:13:50] <mru> what if X1 and X4 are zero?
[03:13:58] <mru> and X2 and X3 huge
[03:14:27] <Dark_Shikari> no problem
[03:14:35] <Dark_Shikari> the point is we don't want it to saturate in one direction
[03:14:38] <Dark_Shikari> and then need to be _dragged down again_
[03:14:40] <Dark_Shikari> then it breaks
[03:14:45] <Dark_Shikari> so as long as saturation is permanent, we're good
[03:14:52] <mru> ah right
[03:15:08] <Dark_Shikari> ugh, now I have to reorder all my coeffs again
[03:15:21] <mru> you're saturating immediately
[03:15:30] <mru> that's odd
[03:15:31] <Dark_Shikari> pmaddubsw saturates in its output
[03:15:39] <Dark_Shikari> I can use it to do two out of six coeffs
[03:15:41] <Dark_Shikari> and saturate the output of that
[03:15:46] <michaedw> you can use the 128-bit-wide variant to do the whole row
[03:15:58] <Dark_Shikari> Huh?
[03:16:02] <Dark_Shikari> pmaddubsw is the 128-bit-wide variant.
[03:16:12] <mru> where do you add the residual?
[03:16:17] <Dark_Shikari> this is just MC
[03:16:19] <michaedw> if you subtract the middle row from the other two
[03:16:42] <michaedw> and add the result back
[03:16:52] <mru> and I thought h264 qpel was hairy...
[03:17:27] <michaedw> then none of the intermediate results saturates
[03:17:31] <Dark_Shikari> h264 qpel is hairier.
[03:17:41] <mru> two-stage
[03:18:08] <mru> this is more like the h264 chroma mc
[03:18:31] <Dark_Shikari> yes, it's one-stage
[03:18:35] <Dark_Shikari> And i got it.
[03:18:37] <Dark_Shikari> "results identical"
[03:18:38] <Dark_Shikari> ssse3 done.
[03:18:59] <Dark_Shikari> er, h-filter done.
[03:19:04] <Dark_Shikari> v-filter will be... its own problem.
[03:19:16] <Dark_Shikari> oh, and we'll have to bug BBB to fix his stride (lol)
[03:20:02] <drv> nobody gonna break my stride
[03:23:23] <Dark_Shikari> afk.  mru, hit BBB if he comes on.
[03:26:16] <michaedw> you could also do four adjacent pixels in 9 ops, accumulate using paddw
[03:26:42] <michaedw> 9 pmaddubsw ops, that is
[03:31:53] <michaedw> you'd need to rebase the coefficients to prevent overflows during paddw
[03:32:41] <michaedw> no, you wouldn't; the totals fit in unsigned words
[03:37:27] <Dark_Shikari> um, 9?
[03:37:29] <Dark_Shikari> for 4 pixels??
[03:37:56] <Dark_Shikari> I'm using 3 for 8 pixels
[03:38:03] <michaedw> actually, I think what you want is to load 8 pixels and use this operand 9 times to calculate 3 filter results
[03:38:27] <michaedw> shifting the coeffs in between
[03:39:23] <michaedw> and for that you need coeffs that won't saturate in pairs
[03:39:46] <michaedw> because the pairing is different after shifting the coeffs left one byte
[03:40:41] <michaedw> shifting coeffs instead of pixels, you don't stall
[03:42:36] <michaedw> and you can run several batches of pixels through in parallel, shifted by 3 bytes at a time
[03:43:31] <Dark_Shikari> that's way too complicated
[03:43:49] <Dark_Shikari> I can do a 6-tap filter of an 8x8 block in 18 multiplies
[03:43:53] <Dark_Shikari> er, 24 multiplies.
[03:44:23] <Dark_Shikari> one multiply per 2.7 output pixels
[03:44:25] <michaedw> with what edge conditions?
[03:44:38] <Dark_Shikari> none.  emulated_edge_mc handles those.
[03:45:17] <michaedw> at what cost?
[03:45:23] <Dark_Shikari> effectively zero
[03:45:30] <Dark_Shikari> it costs nothing except for blocks with edge conditions
[03:45:41] <Dark_Shikari> which are O(sqrt(N)) of the total mbs
[03:46:15] <michaedw> right, but what edge conditions on the 8x8 block itself?
[03:46:23] <Dark_Shikari> huh?
[03:46:33] <Dark_Shikari> if motion compensation references pixels outside of the frame
[03:46:40] <Dark_Shikari> those pixels shall be equal to the closest pixel inside the frame.
[03:47:54] <michaedw> right, so you need to pad outside the block with edge pixels to compute the filter, right?
[03:48:03] <Dark_Shikari> yes
[03:49:39] <michaedw> I would think it would be cheapest to do that when loading the first and last batches of pixels in the row
[03:50:07] <Dark_Shikari> um... it's only done in like 1% of blocks
[03:50:11] <Dark_Shikari> it's not worth optimizing
[03:51:14] <michaedw> hmm, we're talking past one another.  I'm not talking about edge blocks.
[03:52:11] <michaedw> I'm talking about the contribution of pixels past the edge of the 8x8 block to the filter outputs near the edge of the block.
[03:52:44] <Dark_Shikari> why does that matter?
[03:52:48] <Dark_Shikari> movdqu, [src-2]
[03:52:54] <Dark_Shikari> bam, now you have all the pixels you need to do your filter.
[03:53:06] <Dark_Shikari> there is no "boundary" you have to obey
[03:53:08] <Dark_Shikari> just load past it
[03:54:08] <michaedw> great.  now you run your three sets of filter coeffs over it, shift them, run them over it, shift them, run them over it.  right?
[03:54:43] <Dark_Shikari> shift, what?
[03:55:00] <Dark_Shikari> http://pastebin.com/ZWd1pzRK
[03:55:03] <Dark_Shikari> there's the code
[03:55:07] <Dark_Shikari> "shift, what shift?"
[03:56:37] <michaedw> the pshufb's seem expensive
[03:56:50] <Dark_Shikari> 1/0.5 (latency / inverse throughput)
[03:56:53] <Dark_Shikari> i.e. two per cycle, 1 latency to finish
[03:59:36] <michaedw> seems like your multipliers are going to spend a lot of time stalled
[04:00:06] <Dark_Shikari> why?
[04:00:13] <Dark_Shikari> they take 3 cycles and you can issue one per cycle
[04:00:23] <Dark_Shikari> as each pshufb finishes, the multiply begins
[04:00:30] <Dark_Shikari> as each multiply finishes, an add begins
[04:00:42] <michaedw> the loop is bottlenecked on m0
[04:01:02] <Dark_Shikari> Not quite.
[04:01:14] <Dark_Shikari> The CPU can start executing the top of the loop before the bottom is finished.
[04:01:18] <Dark_Shikari> Isn't x86 great?
[04:02:01] <michaedw> core2 pipeline is 14 stages deep
[04:02:13] <Dark_Shikari> the reorder buffer is something like 50+ instructions
[04:02:45] <michaedw> which would help, if m0 didn't get reused for every batch of pixels
[04:02:51] <michaedw> with an unaligned load
[04:03:35] <michaedw> when you could calculate 3 pixels per load by shifting the filters
[04:03:38] <Dark_Shikari> doesn't matter
[04:03:43] <Dark_Shikari> the cpu can track dependencies.
[04:03:58] <Dark_Shikari> it knows the m0 at the top of iteration N+1 is unrelated to the m0 at the bottom from iteration N
[04:09:39] <michaedw> Dark_Shikari: you're probably right there; thinko from RISC habits
[04:10:33] <michaedw> how expensive is the unaligned load?
[04:13:17] <michaedw> and now that I look at it, I see that the shuffle indices result in register starvation
[04:14:06] <michaedw> probably doesn't matter, the loads in the loop can be scheduled early enough
[04:14:42] <michaedw> yeah, my RISC instincts don't serve me well here at all
[04:21:01] <michaedw> oh, yeah, there is a reason to do this in parallel
[04:21:24] <Dark_Shikari> the unaligned load costs two aligned loads on any intel chip
[04:21:29] <Dark_Shikari> except when it crosses a cacheline
[04:21:34] <Dark_Shikari> in which case it costs one L1 cache miss.
[04:21:43] <Dark_Shikari> on an AMD chip, it costs two aligned loads, except on phenom, where it costs one.
[04:22:17] <michaedw> you want to do a second pass interpolating along the other axis
[04:22:37] <Dark_Shikari> that's the V filter.
[04:22:39] <Dark_Shikari> This is the H filter.
[04:22:56] <michaedw> so you want to be accumulating pixels into a rotated matrix
[04:23:04] <Dark_Shikari> Only in HV mode.
[04:23:10] <Dark_Shikari> H mode == just H interpolation
[04:23:14] <Dark_Shikari> V mode == just V interpolation
[04:23:16] <Dark_Shikari> HV == both
[04:23:16] <michaedw> and writing them word-wise
[04:23:19] <michaedw> right
[04:23:27] <Dark_Shikari> in HV mode, there's just a temp array that we write to
[04:23:31] <Dark_Shikari> the core function isn't aware of it
[04:23:32] <Dark_Shikari> a wrapper handles it
[04:23:50] <Dark_Shikari> writing as words might help but is unnecessary, we have to do intermediate clamping in betwe--- wait a minute.
[04:23:55] <Dark_Shikari> WHAT?
[04:23:57] <Dark_Shikari> WHAAAAAAAAAAAT?
[04:24:06] <Dark_Shikari> There's intermediate clamping in the MC?
[04:24:11] <Dark_Shikari> What the fuck are they on
[04:28:06] <michaedw> probably because they implemented it in assembly first, then settled for whatever saturation behavior it gave them
[04:29:06] <michaedw> the spec also shows interpolation in 1/8-pixel intervals
[04:29:55] <michaedw> which you could do with 4 sets of filter coeffs and a byte-reversed copy of the pixels
[04:30:09] <Dark_Shikari> no, that isn't what I mean
[04:30:16] <Dark_Shikari> they explicitly round off after the first pass of the interpolation
[04:30:24] <Dark_Shikari> and then unpack _again_ to get the second pass
[04:30:31] <Dark_Shikari> i.e. they intentionally drop all internal precision between passes
[04:30:31] <michaedw> yep
[04:31:07] <michaedw> probably so they could byte-pack an intermediate array
[04:31:15] <michaedw> rotated :-)
[04:35:08] <michaedw> if it were me, I'd probably do something like that; work in 4-row stripes, write them rotated
[04:35:22] <Dark_Shikari> um, why would you rotate?
[04:35:25] <Dark_Shikari> there's no reason to rotate anything
[04:35:38] <michaedw> minimize store bandwidth
[04:35:55] <michaedw> on most architectures, unaligned stores are read-modify-write
[04:36:24] <Dark_Shikari> but the stores aren't unaligned
[04:36:59] <michaedw> depends how you interleave the 8 interpolated columns that you get from the H pass
[04:37:42] <Dark_Shikari> you just write a V filter and an H filter
[04:37:44] <Dark_Shikari> it's not that hard
[04:37:45] <michaedw> I'd want to store rotated, ready to be loaded and V filtered
[04:37:58] <Dark_Shikari> that's stupid
[04:38:08] <michaedw> because when you do the V filter pass, you're cold-cache
[04:38:09] <Dark_Shikari> H and V are both fast operations
[04:38:28] <michaedw> way cheaper to do the rotation while you've got the data in cache
[04:38:47] <Dark_Shikari> um....
[04:38:51] <Dark_Shikari> it will all be in L1 cache.
[04:38:57] <Dark_Shikari> unless you're on a CPU with 320 bytes of cache
[04:38:58] <Dark_Shikari> or something
[04:39:02] <Dark_Shikari> even then I think you can still fit it all in cache
[04:40:32] <michaedw> the block explodes 8-fold when you do all the interpolations
[04:41:04] <Dark_Shikari> um, no it doesn't
[04:41:13] <Dark_Shikari> you have no idea how motion compensation works
[04:41:51] <michaedw> I have a patent in this area
[04:42:08] <michaedw> an old patent, but still
[04:43:01] <michaedw> I could well be confused about this specific codec, but not because I "have no idea"
[04:47:37] <michaedw> ah, I see where our perspectives differ.  I am refactoring to not stride from row to row.
[04:49:47] <michaedw> I'm thinking in terms of implementing this during the initial streaming pass over the incoming pixels.
[04:50:20] <Dark_Shikari> "initial streaming pass"?
[04:50:28] <Dark_Shikari> We're pointing a motion vector to a reference frame.
[04:50:32] <Dark_Shikari> there is no "initial streaming pass"
[04:51:05] <michaedw> If you already have the frame in memory, and you can afford the hit to your fetch address prediction, sure, stride from row to row
[04:51:57] <Dark_Shikari> in memory?
[04:51:58] <Dark_Shikari> as opposed to what
[04:52:00] <Dark_Shikari> in thin air?
[04:52:11] <Dark_Shikari> "fetch address prediction"?
[04:52:23] <michaedw> and I am of course thinking of encoding, not decoding, so I'm completely on crack
[04:52:47] <Dark_Shikari> on encoding you'd do the same thing
[04:52:59] <Dark_Shikari> Since there's no magic shortcut you can take for an unstaged filter.
[04:53:04] <Dark_Shikari> like you can for say h264.
[04:53:57] <michaedw> I'm thinking motion estimation, and producing subsampled frames for comparison to the reference frame
[04:54:36] <michaedw> for which you really do need the whole range of subpixel shifts
[04:54:59] <Dark_Shikari> no you don't
[04:55:04] <Dark_Shikari> a diamond search is usually enough
[04:55:18] <Dark_Shikari> in qpel, that would end up searching a small (albeit significant) fraction of the positions
[04:55:33] <Dark_Shikari> of course in h264 you just pre-interpolate the hard 6-tap and do linear on the fly.
[04:56:47] <michaedw> depends what your scene looks like; sometimes the local minimum is not the best estimate
[04:57:55] <michaedw> visually, after quantization, if not arithmetically
[04:58:14] <michaedw> hard edges can throw off individual estimates
[04:58:27] <Dark_Shikari> that's what psy metrics are for
[04:58:47] <michaedw> the overall shape of the well is often a better predictor
[04:59:06] <Dark_Shikari> it doesn't matter what the shape of the well is
[04:59:09] <Dark_Shikari> it matters what the result looks like
[05:00:02] <michaedw> right.  and that depends on whether the detail you lose in quantization is the interesting detail or not.
[05:00:51] <Dark_Shikari> Which you can measure.
[05:01:16] <michaedw> the local minimum of the mean-square delta in luma relative to the reference frame is not always the best predictor.
[05:01:29] <Dark_Shikari> Whoever said to use mean-square?
[05:01:41] <michaedw> your choice of statistic
[05:01:43] <Dark_Shikari> SSD is a horrible motion search metric
[05:01:44] <Dark_Shikari> nobody uses it
[05:01:52] <Dark_Shikari> SAD is by far the most common.
[05:01:55] <Dark_Shikari> SATD is a lot better.
[05:02:02] <Dark_Shikari> SADCTD is marginally better but not worth it.
[05:02:04] <Dark_Shikari> RD is better.
[05:02:08] <Dark_Shikari> RD with a psy metric is even better.
[05:03:13] <Dark_Shikari> (by "motion search" I  mean subpel search here.  SAD is sufficient for fullpel in most cases.)
[05:09:02] <michaedw> at the time that I looked closely at the problem, I found that estimating vertical and horizontal separately and then doing a local search using something close to SAD gave the best bang for the buck; but then we couldn't afford a Hadamard in real-time back then
[05:10:16] <Dark_Shikari> heh, must have been a long time ago, or on very limited hardware =p
[05:10:23] <michaedw> (what we used in the field was closer to a sum of clamped absolute differences, with the clamping done with some crude light level dependence)
[05:10:48] <michaedw> designed in 1990
[05:11:13] <Dark_Shikari> 1990?  did mpeg-1 even exist then?
[05:11:15] <Dark_Shikari> or was this h261?
[05:11:18] <michaedw> pity the guy who owned the company was clueless, he could have had a stake in the MPEG pool
[05:11:32] <michaedw> industrial application
[05:12:00] <michaedw> highway survey camera, if you can believe it
[05:12:30] * Dark_Shikari has heard all kinds of great hardware encoder stories
[05:12:35] <Dark_Shikari> like Harmonic's system
[05:12:35] <michaedw> prototyped on the first IndigoVideo board ever to leave the SGI premises
[05:12:37] <Dark_Shikari> they had an MPEG-2 encoder
[05:12:42] <Dark_Shikari> when H.264 came out, they were going to bootstrap it to h264
[05:12:44] <Dark_Shikari> it was DSP-based
[05:12:53] <Dark_Shikari> the guy who ran the project decided they didn't need deblocking
[05:12:57] <Dark_Shikari> because it was "only there to fix mistakes you made"
[05:13:05] <Dark_Shikari> "and we don't make mistakes"
[05:13:09] <michaedw> and the second, and the third, and the fourth
[05:13:21] <michaedw> *serious* infant mortality :-)
[05:13:25] <Dark_Shikari> by the time they finished, they realized that they didn't even have enough power to do both intra and inter analysis per block
[05:13:35] <Dark_Shikari> they ended up pitching out the entire MB core and buying another.
[05:13:41] <michaedw> yum
[05:14:42] <michaedw> anyway, I may be clueless about VP8, but not because I never thought about motion compensation :-)
[05:14:55] <Dark_Shikari> I don't think much of anyone has a clue about vp8
[05:15:11] <Dark_Shikari> even the original devs must be high
[05:15:29] <Dark_Shikari> that's the only way they could have come up with some of this shit
[05:15:43] <michaedw> it looks to me like it's designed for feasible encoding on foreseeable mobile processors
[05:15:58] <michaedw> multiple ARM cores with NEON SIMD, for instance
[05:16:36] <Dark_Shikari> _encoding_?  you crazy?
[05:16:40] <Dark_Shikari> their current encoder is slow as crap :/
[05:16:50] <michaedw> lots of raw ops, but inadequate memory bandwidth and no cache to speak of
[05:16:51] <Dark_Shikari> and none of it screams "fast encoding" to me
[05:17:03] <Dark_Shikari> no cache?  they have just as much L1 cache as a modern x86
[05:17:16] <michaedw> it's the small L2 that hurts
[05:17:19] <Dark_Shikari> No, L2 is useless
[05:17:27] <Dark_Shikari> L2 is only necessary to catch what falls out of L1
[05:17:55] <Dark_Shikari> ok, obviously not useless, but it's not going to be the killer
[05:18:01] <Dark_Shikari> and VP8 did nothing to save L2
[05:18:19] <Dark_Shikari> at least not as far as I can see
[05:18:36] <michaedw> L2 is what cuts your memory bandwidth down to size, and allows read-modify-write to suck less
[05:18:49] <Dark_Shikari> but nobody does read-modify-write.
[05:18:59] <Dark_Shikari> unaligned stores simply don't exist in video encoders
[05:19:07] <Dark_Shikari> unaligned _loads_ are a huge cost
[05:19:08] <michaedw> what else is any ram access smaller than a cache line?
[05:19:35] <Dark_Shikari> I'm pretty sure the CPU doesn't have to update RAM until the cacheline is evicted.
[05:19:41] <michaedw> *exactly*
[05:21:37] <michaedw> cache-bypassing loads, combined with stores to "freshly allocated" memory (so the CPU is told not to bother loading the pre-existing contents of the cache line)
[05:21:55] <Dark_Shikari> cache-bypassing loads?  that's retarded
[05:22:05] <Dark_Shikari> then you have 300 cycles of latency you have to somehow hide
[05:22:27] <michaedw> yep
[05:22:35] <michaedw> *long* pipelines
[05:22:39] <Dark_Shikari> This is stupid.
[05:23:03] <Dark_Shikari> The best way to deal with a small L1 is to keep your working set small.
[05:23:08] <michaedw> 32-byte-wide cache-bypassing loads
[05:23:09] <Dark_Shikari> Not to use tons of "cache bypassing loads"
[05:23:28] <Dark_Shikari> I give up, I'll wait till morning for mru to come back and beat sense into you
[05:23:35] <Dark_Shikari> since he knows more about ARM than every single person in this channel combined
[05:24:03] <michaedw> I'm sure he does know way more than I do
[05:24:09] <michaedw> about ARM, among other things
[05:25:53] <michaedw> but check out VLDM some time
[05:26:40] <Dark_Shikari> what's special about VLDM?
[05:26:42] <Dark_Shikari> it's a nice instruction
[05:27:43] <michaedw> especially when you use it to fetch a whole, aligned cache line
[05:29:22] <michaedw> direct access to L2 cache, bypassing L1
[05:30:12] <michaedw> (you force a preload into the L2 cache in advance, using LDR)
[05:30:56] <astrange> where's a set of vp8 files?
[05:31:06] <michaedw> in a way, it's back to the bad old days of explicit prefetch
[05:31:40] <michaedw> but it makes up for the neon<->arm latency
[05:31:43] <Dark_Shikari> astrange: http://code.google.com/p/webm/downloads/detail?name=vp8-test-vectors-r1.zip&can=2&q=
[05:34:18] * Dark_Shikari wonders where BBB is
[05:35:19] <michaedw> analogous tricks worked back in the StrongARM days; we got about 20fps quarter-VGA MPEG-1 decode on a 200-MHz SA-1100 (8KB data cache)
[05:37:16] <michaedw> I expect to see VP8 doing 720p30 encode on one core of a dual ARMv7, leaving enough memory bandwidth for the other to do UI and network and such
[05:37:24] <Dark_Shikari> hah.
[05:37:36] <Dark_Shikari> You haven't actually tried this have you
[05:37:46] <Dark_Shikari> I just had to implement 1024x768p30 video playback on the iPad.
[05:37:50] <Dark_Shikari> It cannot decode MPEG-1 in realtime.
[05:38:05] <Dark_Shikari> VP8, by comparison, is marginally more complex than H.264.
[05:38:19] <Dark_Shikari> the iPad has a 1ghz armv7.
[05:38:22] <Dark_Shikari> With neon.
[05:38:42] <ohsix> and unicorns
[05:38:44] <Dark_Shikari> I ended up hand-optimizing the FLV decoder (it's simpler than MPEG-1) for up to 30-40% performance gain.
[05:39:00] <Dark_Shikari> It was still too slow to consistently reach 30fps in the hardest scenes.
[05:39:20] <Dark_Shikari> Trying to play 720p (higher res) VP8 (easily 2-3x harder than mpeg-1) on such a thing would be laughable
[05:39:30] <Dark_Shikari> with lavc, it can't play h264 at 15fps.....
[05:39:31] <Dark_Shikari> with deblocking off
[05:39:32] <Dark_Shikari> subpel off
[05:39:34] <Dark_Shikari> cabac off
[05:39:36] <Dark_Shikari> bframes off
[05:39:38] <Dark_Shikari> weighted pred off
[05:45:40] <michaedw> I would expect to see the cabac equivalent on the other cpu (the one doing the network layer anyway).  there are no B frames in VP8.  subpel will probably be scaled down on mobile encoders to 1/4 pixel units, maybe even 1/2 pixel.
[05:46:07] <Dark_Shikari> it already is 1/4 pixel units.
[05:46:13] <Dark_Shikari> there's no hpel option in the spec.
[05:46:26] <Dark_Shikari> b-frames decrease complexity when arithmetic coding is enabled.
[05:46:43] <michaedw> for decoding, yes
[05:46:57] <peloverde> DS, but can't the beagle do 720p h.264 in software?
[05:47:00] <Dark_Shikari> no way
[05:47:02] <Dark_Shikari> no way in hell
[05:47:07] <Dark_Shikari> Unless you're using the C64x+ DSP.
[05:47:15] <Dark_Shikari> beagle is 40% slower than an ipad
[05:47:24] <Dark_Shikari> remember the videowall?  that was mpeg-2 at 960x540 or whatnot.
[05:48:11] <superdump> i thought the videowall at linuxtag this year was running the native resolution of the array of monitors
[05:48:20] <superdump> for bbb
[05:48:22] <michaedw> I am not talking about currently shipping chips.  I am particularly thinking of this: http://www.qualcomm.com/news/releases/2010/06/01/qualcomm-ships-first-dual-cpu-snapdragon-chipset
[05:49:02] <Dark_Shikari> superdump: yes
[05:49:04] <Dark_Shikari> which was that iirc
[05:49:10] <Dark_Shikari> remember the total was 2700xsomething
[05:49:15] <Dark_Shikari> that means each one was width ~900something
[05:49:31] <superdump> oh, of course
[05:49:32] <michaedw> the hardware 1080p is good, but power-intensive
[05:50:11] <superdump> a friend did a brief test of power usage when decoding 1080p h.264 on his macbook pro the other day
[05:50:19] <superdump> when using the cpu it was using about 10W
[05:50:38] <superdump> when using the gpu (9400M, so maybe it was a macbook, not a pro) it was using about 6W
[05:50:47] <Dark_Shikari> s/gpu/asic
[05:51:13] <michaedw> http://www.engadget.com/2010/06/21/toshibas-ac100-8-hour-smartbook-runs-android-2-1-on-a-1ghz-tegr/
[05:52:24] <michaedw> that's unlikely to do 720p30 encode, though
[05:52:31] <Dark_Shikari> or decode
[05:54:12] <michaedw> decode using the hardware unit, I don't see why not
[05:54:18] <Dark_Shikari> well of course
[05:54:20] <michaedw> though again that probably eats power
[05:54:23] <Dark_Shikari> but the "hardware unit" is very specialized
[05:54:47] <michaedw> parts of it are.  the rest is just DSP.
[05:55:21] <Dark_Shikari> as I said, talk to mru
[05:55:28] <Dark_Shikari> something something OMAP4 something
[05:55:37] <Dark_Shikari> they rely extremely heavily on asynchronous functional units connected by  sram
[05:55:44] <Dark_Shikari> e.g. "h264 motion compensation" and "h264 idct"
[05:55:47] <Dark_Shikari> and "h264 cabac"
[05:55:50] <Dark_Shikari> with 2000-page APIs
[05:55:58] <michaedw> dedicated cabac, I expect; may or may not be flexible enough to do VP8's entropy encoding
[05:56:31] <Dark_Shikari> Having glanced at the APIs...... no.
[05:56:32] <michaedw> if it's anything like the Qualcomm equivalent, yes, it relies on some tightly-coupled memory
[05:56:56] <michaedw> and cache-bypassing access to it :-)
[05:57:03] <Dark_Shikari> it's "flexible" because they hardcode 5 different codecs into the silicon
[05:57:15] <Dark_Shikari> not because the silicon is flexible enough to do 5 different codecs
[05:58:05] <michaedw> yes, but that's not what the next generation is going to look like
[05:58:12] <Dark_Shikari> OMAP4 is the next generation.
[05:58:17] <michaedw> for TI
[05:59:16] <michaedw> the only OMAP generation I know well is the fixed-point DSP version, OMAP5912 and the like; oldish now
[06:02:58] <michaedw> what's shipping today looks more like this: http://www.radvision.com/Corporate/PressCenter/2009/25march2009_hd_engine.htm
[06:03:34] <michaedw> in terms of video coding on TI chips, that is
[06:03:46] <Dark_Shikari> also, I know that mediatek chipsets work the same way.
[06:03:49] <Dark_Shikari> i.e. tons of hardcoded shit.
[06:04:14] <michaedw> sigma designs went pretty far down that road, too
[06:04:30] <michaedw> but I think it's a dead end
[06:04:43] <Dark_Shikari> I think it's fine, because a new video format doesn't come out every 5 days
[06:04:54] <michaedw> I certainly wouldn't design a product today around a hard-function video pipeline
[06:05:07] <michaedw> any more than I would around a hard rendering pipeline
[06:05:13] <michaedw> compare OpenGL ES 1.1 and 2.0
[06:06:07] <michaedw> you know any hard encoders that do a good job of GDR?
[06:06:27] <Dark_Shikari> "hard encoders"?
[06:06:31] <Dark_Shikari> as opposed to easy encoders?
[06:06:51] <michaedw> true hardware h.264 encoding silicon
[06:07:05] <michaedw> not DSPs or FPGAs
[06:08:18] <Dark_Shikari> I don't know of any true hardware encoding silicon.
[06:08:24] <Dark_Shikari> I know it exists, but I don't know of anything in particular.
[06:08:37] <Dark_Shikari> I know of one "hardware" solution that does GDR, but it's a DSP and it's godawful buggy shit
[06:08:56] <Dark_Shikari> and call it intra refresh.
[06:09:02] <michaedw> yeah, I can't find anything I'd take a second look at, either
[06:09:31] <Dark_Shikari> I love it when companies release "hardware" solutions that can't even do their claims
[06:09:34] <Dark_Shikari> the one I mentioned, the DSP
[06:09:36] <j0sh_> michaedw: ittiam h264 uses gdr, but i'm 90% sure its dsp based
[06:09:44] <Dark_Shikari> did 720p... with no partitions, no ratecontrol, no subpel, no nothing
[06:09:49] <Dark_Shikari> the instant you added anything it went slow as crap
[06:09:54] <michaedw> I'd rather let DSPs do what they're good at (DCTs), hardware do what it's good at (bitstreams), and software do what it's good at (field updates)
[06:10:02] <Dark_Shikari> you don't need DSPs for DCTs
[06:10:09] <Dark_Shikari> DCTs are fast.
[06:10:54] <michaedw> but they also involve moving a lot of data in and out, and I'd rather not waste my CPU's memory bandwidth on that
[06:11:17] <michaedw> I don't really want to see the pixels until after they've been DCTed and quantized and packed
[06:11:31] <Dark_Shikari> "packing" is bitstream
[06:11:34] <Dark_Shikari> quantization is fast
[06:11:44] <Dark_Shikari> 8x8 h264 dct on i7: 51 cycles
[06:11:56] <michaedw> packed into residual blocks, pre-cabac
[06:12:02] <Dark_Shikari> 8x8 h264 quant on i7: 22 cycles
[06:12:18] <michaedw> sure, it's fast, once you've got it in L1
[06:12:27] <Dark_Shikari> 8x8 h264 zigzag: 23 cycles
[06:12:30] <michaedw> I have better uses for L1
[06:12:32] <Dark_Shikari> It's always in L1
[06:12:34] <Dark_Shikari> in an encoder, at least
[06:12:46] <Dark_Shikari> And if you think that 256 bytes of L1 is your big worry...
[06:12:49] <michaedw> yes, and it can be in the other core's L1, thank you :-)
[06:12:58] <Dark_Shikari> oh no, 256 bytes of L1 spent
[06:13:00] <Dark_Shikari> on something important
[06:13:01] <Dark_Shikari> whatever will I do!
[06:13:17] <michaedw> it's not the footprint, it's the bandwidth
[06:13:23] <Dark_Shikari> L1 bandwidth is practically unlimited.
[06:13:33] <michaedw> bandwidth to main memory is not
[06:13:40] <Dark_Shikari> But this never reaches main memory.
[06:13:49] <Dark_Shikari> Ever.
[06:13:56] <Dark_Shikari> Except maybe during a context switch.
[06:13:56] <michaedw> pixels gotta get in there somehow
[06:14:00] <Dark_Shikari> Pixels != dct
[06:14:41] <Dark_Shikari> Let's just say that the benchmarks have consistently shown that x264 is not bottlenecked by main memory bandwidth.
[06:14:51] <Dark_Shikari> To the point where people have added faster DDR and measured zero performance change.
[06:14:53] <michaedw> on an i7, sure
[06:15:07] <Dark_Shikari> i7s don't have a lot of memory bandwidth.
[06:15:15] <michaedw> I want all that done with negligible impact on my poor mobile DDR
[06:15:27] <Dark_Shikari> They have quite a bit less per compute power than many smaller cpus.
[06:16:10] <michaedw> on the fly, as the pixels arrive, with enough tightly coupled memory to hold one and a half macroblock heights' worth of rows
[06:16:49] <michaedw> the big problem with dsp encoders isn't the dsp, it's the ddr dedicated to it
[06:16:55] <michaedw> cost, power, footprint
[06:18:34] <pengvado> one and a half macroblock rows plus the areas of all the reference frames they use?
[06:18:52] <michaedw> no B frames = only one reference frame
[06:19:02] <Dark_Shikari> what?
[06:19:08] <Dark_Shikari> there is no equivalence there
[06:19:13] <Dark_Shikari> kumquat = harley-davidson
[06:20:11] <michaedw> let me state that the other way around; if you're doing B frames, you're coding relative to an interpolation between previous and next I/P frames
[06:20:46] <michaedw> even if you could afford the latency, the buffering kills
[06:20:53] <Dark_Shikari> ...?
[06:22:22] <michaedw> say you have two B frames between two non-B frames
[06:23:19] <michaedw> you've got to wait until you have both reference frames, then calculate the predicted (interpolated) frame, then calculate residuals, right?
[06:23:50] <michaedw> I am going on MPEG2-era knowledge, but surely if it didn't work that way in MPEG4, they wouldn't call them B frames
[06:24:05] <michaedw> (I don't know, we don't use them)
[06:24:35] <Dark_Shikari> "wait until...." to do what?
[06:25:01] <michaedw> to be able to calculate residuals for the B frames
[06:25:17] <Dark_Shikari> and this is bad because...
[06:25:59] <michaedw> so that's two frames' worth of pixels buffered, in addition to the reference frames
[06:26:11] <Dark_Shikari> and this is bad because...
[06:26:13] <michaedw> not free on an embedded system
[06:26:34] <Dark_Shikari> If we got rid of everything that wasn't free, a lot of things would be pretty shitty.
[06:27:54] <michaedw> I'm not so much arguing for the quality per MB -- VP8 is presumably not for Blu-Ray -- but trying to understand the design choices
[06:28:06] <Dark_Shikari> B-frames weren't chosen because they were patented.
[06:28:08] <Dark_Shikari> End-of.
[06:28:14] <Dark_Shikari> VP8 uses the altref, which is just as costly as B-frames in terms of memory.
[06:28:27] <Dark_Shikari> And more costly in terms of CPU, because it's coding an entire frame that is never displayed.
[06:28:44] <Dark_Shikari> VP8 uses the golden frame, which is yet _another_ extra frame to store, increasing memory usage.
[06:30:51] <michaedw> but that I can afford in the CPU that sees the entropy-decoded bitstream
[06:31:30] <michaedw> and I can tell the thing that's tightly coupled to the display to prefetch the relevant reference block
[06:32:15] <Tjoppen> so just don't use B-frames? afaik you don't gain much by using B-frames if you can't use the next P-frame as reference. might be useful to reduce idct inaccuracies though
[06:33:56] <michaedw> I'm not saying the whole system has no memory.  just that the thing that has most of the memory bolted to it doesn't want to pollute its cache with raw pixels, reference frames included.
[06:34:52] <michaedw> it's perfectly happy to proxy DMA, with enough prefetch lead time to fit the accesses in among the main CPU's actual cache-miss cycles.
[06:36:18] <michaedw> VP8 looks streaming-optimized to me, with videoconferencing and transcoding in mind
[06:36:34] <Dark_Shikari> no, it looks on2-optimized
[06:36:39] <Dark_Shikari> keep in mind that on2 are not very smart
[06:36:51] <Dark_Shikari> there are many things that are in the spec simply because nobody thought to suggest otherwise
[06:37:04] <Dark_Shikari> there are many other things that are just outright bugs and nobody even noticed
[06:37:11] <Dark_Shikari> there are other things they blatantly copied off h264
[06:37:20] <Dark_Shikari> and everything else they copied off their previous codecs
[06:37:25] <Dark_Shikari> there's practically not an ounce of originality in it
[06:37:33] <Dark_Shikari> "golden frames" are there because vp7 had golden frames
[06:37:33] <michaedw> what's so great about originality?
[06:37:37] <Dark_Shikari> golden frames were in vp7 because vp6 had them
[06:37:39] <Dark_Shikari> they were in vp6 because vp5 had them
[06:37:43] <Dark_Shikari> they were in vp5 because vp4 had them
[06:37:56] <Dark_Shikari> originality is good because it lets you be better
[06:38:00] <Dark_Shikari> you can't beat everyone else by copying them
[06:38:05] <Dark_Shikari> you can only be just as good, at best.
[06:38:14] <Dark_Shikari> To win, you must do something they didn't do.
[06:38:30] <peloverde> "On2 also admitted that it's had trouble hiring and retaining skilled, qualified employees" http://news.cnet.com/8301-1023_3-10410341-93.html
[06:38:35] <Dark_Shikari> lol
[06:39:06] <Dark_Shikari> That's how x264 won.  We didn't merely do what others did, we did stuff they didn't do.
[06:39:09] <michaedw> Google does a *lot* of video transcoding.  They can probably figure out how to specify a streaming-friendly format that's easy to transcode to, and feasible to encode in real-time on mobile hardware.
[06:39:18] <Dark_Shikari> Google didn't do anything
[06:39:19] <Dark_Shikari> they bought on2
[06:39:52] <michaedw> I suppose it's possible that on2 refined their codec in the void, with no target application in mind
[06:40:02] <Dark_Shikari> they refined it with their own target applications in their head in mind
[06:40:13] <Dark_Shikari> In theory.
[06:40:19] <Dark_Shikari> It's a company.  Companies do what companies do.
[06:40:28] <michaedw> "getting bought by Google" is a nice target application
[06:40:28] <Dark_Shikari> they're inefficient and slow.
[06:40:38] <Dark_Shikari> Indeed, for those who still had stock options
[06:40:42] <transport> didnt google also pay for x264 devs to code their backends too ?
[06:40:44] <michaedw> high inertia, that's for sure
[06:40:53] <michaedw> I certainly would, if I were Google
[06:41:09] <michaedw> and couldn't figure out how to hire them without spoiling their effectiveness
[06:41:15] <Dark_Shikari> transport: no
[06:41:29] <Dark_Shikari> michaedw: I'm not sure they were ever effective
[06:41:40] <michaedw> I'm referring to the x264 devs :-)
[06:41:42] <Dark_Shikari> transport: google has gone out of their way to be as detached from the open source community as possible
[06:41:53] <Dark_Shikari> they refuse, by policy, to admit they use ffmpeg.
[06:41:59] <michaedw> Dark_Shikari: not really.  I spent a couple of years there.
[06:42:00] <Dark_Shikari> they refuse, by policy, to contribute patches.
[06:42:09] <Dark_Shikari> they refuse, by policy, to give us sample videos, even if they're in public domain.
[06:42:34] <Dark_Shikari> michaedw: did you know Pascal Massimino?
[06:42:46] <michaedw> there are parts of "the open source community" with which they engage fairly intensively
[06:42:58] <michaedw> kernel, Ubuntu
[06:43:03] <Dark_Shikari> Certainly not us.  Despite the fact that they have tens of thousands of computers running our software.
[06:43:41] <michaedw> never met Pascal
[06:43:44] <peloverde> WebM had dozens of zero day partners but they didn't presubmit their FFmpeg patches
[06:43:51] <michaedw> didn't work in anything remotely video-related
[06:43:53] <Dark_Shikari> ah
[06:43:57] <Dark_Shikari> pascal was an old xvid dev
[06:43:59] <Dark_Shikari> google hired him, he still works there
[06:44:03] <Dark_Shikari> he wrote their original h264 encoder, himself
[06:44:16] <michaedw> open source types tend to go dark on joining google
[06:44:29] <michaedw> sometimes they come out into the light again after awhile, sometimes not
[06:44:29] <Dark_Shikari> Everyone does.
[06:44:38] <Dark_Shikari> that's why I refuse to work for them
[06:44:42] <Dark_Shikari> they tried to hire me
[06:44:46] <Dark_Shikari> No fucking way.
[06:44:48] <michaedw> there's an awfully big playground inside
[06:44:58] <Dark_Shikari> Yes, a big playground with metal bars on the windows.
[06:45:02] <astrange> you're permitted to work on open source in google
[06:45:14] <astrange> but many people seem to not bother doing it in 20%
[06:45:16] <Dark_Shikari> astrange: They said I wasn't allowed to own my open source contributions.
[06:45:18] <michaedw> lots of cool stuff to work on, thousands of people they can talk about it with in a completely unrestricted manner
[06:45:21] <Dark_Shikari> I said they could go fuck themselves.
[06:45:39] <michaedw> and money, of course
[06:45:43] <Dark_Shikari> Not in those exact words obviously.
[06:46:13] <michaedw> you have to be pretty organized to make 20% time work for you that way
[06:46:34] <Dark_Shikari> I prefer to work at a company where I have my 100% time.
[06:46:45] <michaedw> organized is perhaps not the first adjective I think of when I think of people who are involved in, but not central to, open source projects
[06:47:05] <michaedw> some people at Google do
[06:47:17] <michaedw> akpm, for instance
[06:47:37] <michaedw> but that's almost more of a sponsorship than traditional employment
[06:47:52] * Dark_Shikari is somewhere in between.
[06:48:34] <michaedw> I don't think "metal bars" is fair.  They just expect value for money, and a fairly high degree of commitment to team goals.
[06:49:03] <Dark_Shikari> "committment to team goals" seems to be codephrase for giving up the community
[06:49:10] <Dark_Shikari> I have never seen a single person absorbed by google who stayed open
[06:49:17] <michaedw> it wasn't the place for me -- more to the point, very much not the right role for me -- but I don't have anything nasty to say about them
[06:49:19] <Dark_Shikari> they go dark and disappear
[06:49:38] <Dark_Shikari> I mean, I don't hate google.  I just don't subscribe to the "omg they're the best place to work evar"
[06:50:30] <michaedw> I liked working there.  Interesting people to rub up against.  I'd have liked it more if those interesting people weren't quite so absorbed in their own projects.
[06:50:40] <KotH> ohayou gozaimasu!
[06:50:47] <Dark_Shikari> I guess I've been there... twice now
[06:50:54] <Dark_Shikari> food was decent.  I liked the free Naked Juice things.
[06:51:00] <Dark_Shikari> but facebook had those too, and facebook's food was better.
[06:51:15] <Dark_Shikari> google was a bit larger, too spread out.
[06:51:29] <michaedw> I liked the Facebook culture, too -- but there's less there there, as far as I can see
[06:51:41] <michaedw> Google is struggling with the next e-folding
[06:52:09] <michaedw> I kept telling people there, don't think you're so much better than M$ until you've hit their scale
[06:52:26] <michaedw> I'm old enough to remember when M$ were the good guys
[06:52:42] <Dark_Shikari> no, they ever were
[06:52:54] <michaedw> the first Unix you could run on hardware that you could buy retail
[06:53:03] <Dark_Shikari> they were the "good guys" to developers.  that's how they won.
[06:53:26] <Dark_Shikari> Well, and the whole cheating the hell out of ibm.
[06:53:30] <michaedw> and a CP/M clone that mostly worked and had documentation included
[06:53:42] <michaedw> ibm got their slice of the pie
[06:54:12] <transport> even back in the day when bill gayes write the amiga BASIC they were not good guys LOL
[06:54:16] <michaedw> I was an Apple fanboy back then, but I could give M$ their due
[06:54:20] <transport> gates
[06:55:18] <michaedw> and COM was bloody brilliant
[06:56:21] <michaedw> anyway, Larry and Sergey (and Eric) are very different people from Bill and Steve
[06:56:31] <michaedw> and choose to run their company very differently
[06:56:48] <michaedw> but they still haven't proven they can scale to 50K people and remain human
[06:58:05] <michaedw> I think Google has the same problem contributing to ffmpeg that most big companies do.  not so much their own IP, but their agreements with other companies not to do things that would compromise their partners' IP
[06:58:34] <michaedw> patent pools and cross-licensing agreements and all that crud
[06:59:33] <peloverde> They don't seem to have that problem with their own homegrown opensource projects
[07:00:27] <peloverde> They also managed to sign on an army of vp8 partners but didn't submit their (messy) libvpx patches agaisnt FFmpeg until after the public announcement
[07:00:49] <michaedw> they like to make a big splash
[07:01:04] <Dark_Shikari> not really.  google are the masters of failing to make a big splash
[07:01:10] <peloverde> they could have done both
[07:01:10] <Dark_Shikari> They announce new products all the time and fail to get any traction
[07:01:17] <Dark_Shikari> they're like the opposite of apple
[07:01:25] <Dark_Shikari> google announces tons of cool stuff that nobody ever uses
[07:01:28] <peloverde> enough of us were under NDA anyway
[07:01:28] <michaedw> who wants traction? they want drumbeat
[07:01:30] <Dark_Shikari> apple announces a few shitty things that everyone uses
[07:01:30] <transport> ohh Toshiba are bringing out a new netbook, dubbed the Toshiba AC100.a dual-core ARM Cortex-A9 at 1GHz http://topnews.co.uk/27150-toshiba-unveils-android-based-netbook-and-dual-screen-smartbook
[07:02:13] <peloverde> The patches could have been prereviewed by select individuals
[07:02:23] <michaedw> transport: beat you by a little over an hour :-)
[07:02:30] <peloverde> The managed to partner with Sorenson and Adobe and still "make a splash"
[07:02:42] <Dark_Shikari> peloverde: ironically, they distributed the NDA'd stuff in violation of the gpl, I'm pretty sure
[07:02:46] <Dark_Shikari> my distribution didn't have ffmpeg source
[07:02:48] <Dark_Shikari> as far as I saw
[07:02:55] <michaedw> did you ask for it?
[07:03:09] <Dark_Shikari> No, but there wasn't an offer.
[07:03:12] <peloverde> their initial public release wasn't GPL comptatible
[07:03:13] <michaedw> lame
[07:03:24] <Dark_Shikari> what peloverde said as well
[07:03:27] <Dark_Shikari> they didn't plan anything out properly
[07:03:33] <michaedw> "GPL compatible" is an interesting concept
[07:03:42] <peloverde> they distributed a nonfree FFmpeg in chrome for a few weeks
[07:04:11] <transport> ohh but did you also see http://www.youtube.com/watch?v=H4Xr9ZSnXxQ the Toshiba bloke quotes a price of 40,000 to 50,000 Yen which in real money is ?
[07:04:22] <Dark_Shikari> 90 yen -> dollar
[07:04:41] <michaedw> last I checked, Android's compiler didn't do anything exciting for the A9
[07:04:43] <Dark_Shikari> FUCK THIS FUCKING put_vp8_epel8_v6_ssse3 I CANT FIGURE OUT WHERE THE BUG IS AGHGHHHHHHHHHHHHHHH
[07:04:48] <Dark_Shikari> I've spent 3 hours staring at this holy shit
[07:05:06] <michaedw> planning is not Google's strong point
[07:05:30] <astrange> android's compiler is gcc and i don't think they bothered pulling any of google's (very good) gcc engineers to do optimization for it
[07:05:38] <astrange> so you're left with what CS does
[07:06:22] <michaedw> it's not a matter of not bothering; there's just way too much value in what they're already doing to waste them on something as small-time as Android
[07:06:42] <Dark_Shikari> http://pastebin.org/353799 someone give this a glance and figure out where I made my retarded error
[07:06:49] <Dark_Shikari> I get the feeling it's something blatantly obvious that I overlooked
[07:06:53] <michaedw> symptom?
[07:07:10] <Dark_Shikari> all md5s are wrong.
[07:07:12] <Dark_Shikari> output is incorrect.
[07:07:23] <Dark_Shikari> haven't gotten around to printfing.  I should do that.,
[07:07:38] <astrange> they mostly seem to be working on compiling google itself, which i guess is more important to them
[07:08:19] <michaedw> a lot of them are working on tuning for new platform variants
[07:09:06] <michaedw> new hardware, new kernel features
[07:09:11] <michaedw> containerization and all that
[07:10:10] <michaedw> that moves the marbles around enough to force retuning of the index serving stack
[07:10:24] <michaedw> there's no secret there
[07:11:06] <michaedw> and there's Google Go, of course
[07:11:54] <astrange> that's just the plan9 people's thing
[07:12:07] <astrange> i'm amazed that they could spend that long at google and apparently do nothing except go
[07:12:31] <michaedw> http://xkcd.com/303/
[07:12:33] <astrange> and ian taylor managed to write gccgo and gold and several other largeish gcc things in the same time frame
[07:14:56] <michaedw> r's public Google CV lists a pretty big Sawzall paper
[07:15:18] <astrange> oh, he did write sawzall
[07:15:29] <astrange> i was thinking of someone else who hadn't apparently produced anything...
[07:15:41] <astrange> kernighan
[07:15:44] <michaedw> that happens to some people at Google too
[07:16:10] <astrange> or dennis richie. i forget which one is at google now
[07:16:24] <michaedw> not kernighan, afaik
[07:16:57] <michaedw> Ken Thompson is co-Go
[07:22:59] <michaedw> Dark_Shikari: why the mova m3, m4?
[07:24:25] <Dark_Shikari> to prepare m3 for the next iteration
[07:26:22] <Dark_Shikari> in each iteration, m0 == row -2
[07:26:24] <Dark_Shikari> m1 == row -1
[07:26:26] <Dark_Shikari> m2 == row 0
[07:26:28] <Dark_Shikari> m3 == row 1
[07:26:30] <Dark_Shikari> m4 == row 2
[07:26:32] <Dark_Shikari> m5 == row 3
[07:26:35] <Dark_Shikari> for the 6-tap filter
[07:27:12] <michaedw> I don't quite understand the register allocation; why name the pmaddubsw targets m6, m1, m3?
[07:28:06] <Dark_Shikari> m6 is the accumulator
[07:28:10] <Dark_Shikari> m1 and m3 are to avoid moving registers
[07:28:16] <Dark_Shikari> All register choices are to keep the above mapping.
[07:28:24] <Dark_Shikari> and make sure that at the end of the loop, the registers are ready for the next iteration.
[07:28:28] <Dark_Shikari> *and to make sure
[07:28:34] <Dark_Shikari> i.e. m0 is now m1
[07:28:36] <Dark_Shikari> m1 is now m2
[07:28:37] <Dark_Shikari> m2 is now m3
[07:28:39] <Dark_Shikari> m3 is now m4
[07:28:41] <Dark_Shikari> m4 is now m5
[07:28:57] <michaedw> I think there's a mova m1, m2 missing
[07:29:22] <Dark_Shikari> fuck.  you're right.
[07:29:23] <michaedw> maybe after line 31
[07:29:44] <Dark_Shikari> no, but we already lost m2
[07:30:16] <Dark_Shikari> hmm, what's the most elegant way to do this.
[07:30:58] <michaedw> not quite enough registers to work with, I fear
[07:31:16] <Dark_Shikari> we can kick out m7 and use a memory argument
[07:31:30] <michaedw> oh yeah, that's the other reason I would have rotated the intermediate result :-)
[07:31:31] <Dark_Shikari> I've already kicked out 3 regs for the filter coeffs
[07:32:39] * Dark_Shikari tries
[07:32:41] <michaedw> 6 operands pack into 6 bytes of a register better than into 6 registers, when you only have 8 to work with
[07:32:49] <Dark_Shikari> I wonder if it would be better to unroll it by a factor of 2
[07:32:50] <Dark_Shikari> to avoid the moves
[07:32:55] <Dark_Shikari> then you wouldn't have to shift between iterations
[07:32:59] <Dark_Shikari> just swap back and forth between iterations
[07:33:06] <Dark_Shikari> 'results identical' woot
[07:33:14] <Dark_Shikari> thanks for spotting that.
[07:33:19] <michaedw> no worries
[07:33:38] <Dark_Shikari> now to mix in the hv versions
[07:38:11] <michaedw> by the way, what I meant earlier by fetch prediction is the hardware prefetching based on automatically detected "access streams" that was characteristic of the P4
[07:38:32] <michaedw> decent description of it, and how to code to use it well, in http://www.siam.org/proceedings/alenex/2007/alx07_09pans.pdf
[07:40:11] <Dark_Shikari> well, one all-day MC marathon done
[07:40:16] <Dark_Shikari> my second ever.
[07:41:07] <michaedw> supplemented in the Core architecture by an IP-based prefetcher
[07:43:21] <michaedw> so while every access on row stride is going to cost you a full cache line hit to main memory, at least it'll be prefetched after the first few loop iterations
[07:45:40] <KotH> michaedw: is that paper worth to be read?
[07:48:10] <michaedw> KotH: Kevin Dick is considered pretty good, for an undergrad at the time, and any Amazon-Google collaboration is interesting to me
[07:49:00] <michaedw> use of Numerical Recipes in C as a reference implementation is pretty suspect, though
[07:49:16] <michaedw> the Fortran original was adequate, NRC is awful by any standard
[07:51:47] <michaedw> this paper basically illustrates the difference between cache-efficient coding (which is quite difficult) and prefetcher-friendly coding (which is easier, and has fewer tunable parameters, which are easier to tune empirically without knowledge of cache sizes)
[07:52:08] <transport> as asembly coders ,do you think theres any validity to this guys with regard to glibc lack of performance and memory routine optimisations that need doing ? http://www.freevec.org/content/commentsconclusions
[07:52:39] <michaedw> certainly true on modern ARM
[07:53:17] <michaedw> but that's a surface design flaw masking a more fundamental design flaw
[07:54:47] <astrange> i haven't seen a lot of str* hotspots in profiles
[07:54:50] <michaedw> you almost never want any C library's implementations of string functions
[07:55:06] <astrange> not that i profile string programs a lot, but such things can be avoided in hot areas in algorithmic ways instead
[07:55:11] <michaedw> in hot places
[07:56:08] <michaedw> either you need real strings, with Unicode and all that jazz, or you need byte arrays
[07:56:26] <Dark_Shikari> transport: glibc is a disaster
[07:56:53] <michaedw> it does a good job on some important things
[07:57:11] <Dark_Shikari> there are two things at fault
[07:57:13] <Dark_Shikari> "uldrich"
[07:57:13] <Dark_Shikari> and
[07:57:16] <Dark_Shikari> "drepper"
[07:57:17] <michaedw> things that are tightly coupled to the kernel
[07:57:32] <michaedw> sure, but he's not the only contributor
[07:57:41] <michaedw> Ingo puts good stuff in
[07:58:04] <kshishkov> ever heard of the first law in organic chemistry?
[07:58:18] <thresh> never speak about organic chemistry?
[07:58:18] <michaedw> brown + any color = brown?
[07:58:19] <Dark_Shikari> I'm pretty sure memcpy is still faster on mac than on glibc
[07:58:21] <Dark_Shikari> which is embarassing
[07:58:53] <michaedw> it's got to handle all the unaligned cases
[07:59:14] <Dark_Shikari> so does mac's
[07:59:15] <astrange> os x spends a lot of time on memcpy
[07:59:21] <kshishkov> michaedw: almost, "if you mix 10 kilos of jam with 1 kilo of shit you'll get 11 kilos of shit"
[07:59:23] <astrange> er, os x engineers
[07:59:29] <Dark_Shikari> os x's uses palignr to handle the unaligned cases
[07:59:54] <transport> isnt that becaouse mac used altivec optimisations were as linux PPC doesnt use any, is that going to be the same for linux arm/NEON too i wonder...
[08:00:05] <Dark_Shikari> glibc on anything not x86 usually sucks
[08:00:13] <michaedw> transport: true as of the last time I looked
[08:00:24] <michaedw> may change due to ChromeOS
[08:00:31] <astrange> memcpy and some other things like spinlocks are stored in the kernel specialized for each cpu/other stuff and mapped in at runtime
[08:00:38] <kshishkov> Dark_Shikari: that's why they try to replace it with something else at least on ARMs
[08:00:38] <astrange> i'm not actually sure why this is necessary
[08:01:00] <astrange> but i think it means they can unmap and then remap in a different spinlock implementation if all the secondary threads die, etc
[08:01:32] <michaedw> I think it's so the linker can specialize the call sites at load time
[08:02:11] <michaedw> ARM thread-local storage accesses work that way, IIRC
[08:03:13] <michaedw> or are you thinking of the stuff that's in whatever acronym replaced VDSO?
[08:03:40] <astrange> i was still talking about the OS X kernel
[08:03:51] <astrange> this one is called commpage
[08:05:06] <michaedw> oh; linux/glibc do something rather like that for certain instructions that require CPU-level access privileges but not privileged memory access
[08:06:35] <michaedw> they trap into the kernel because they're illegal instructions in user mode; the kernel peeks at the IP that trapped, sees that it's in the special page, and returns with the relevant privileges enabled
[08:06:52] <michaedw> trusting that the special page will shut them off again before returning to normal code
[08:07:13] <astrange> moving the ffplay timestamp reordering code to cmdutils.c is making me feel bad for some reason
[08:07:34] <michaedw> sorry, that's the mechanism that replaced VDSO; no relation to the link-time call site specialization
[08:08:01] <astrange> call site specialization can be done by defining memcpy to be a function pointer in the header
[08:08:18] <astrange> would need special knowledge in gcc though
[08:11:16] <michaedw> in userland it's usually done a la libm
[08:15:10] <michaedw> I seem to be unable to shake the name of the special ELF section with the cpu-specific routines out of the cobwebs in my head, even with Google's help
[08:16:58] <astrange> don't see it in objdump -h
[08:17:14] <michaedw> may be ARM-specific
[08:23:34] <michaedw> no, I'm thinking of x86
[08:23:56] <michaedw> I remember seeing multi-byte NOPs
[08:24:18] <michaedw> being used to pad out the replacement code to the same length as the code it replaced
[08:25:56] <michaedw> alternative something
[08:26:37] <michaedw> Bingo: arch/x86/include/asm/alternative.h
[08:28:09] <michaedw> astrange: .altinstructions and .altinstr_replacement
[08:34:42] <michaedw> Dark_Shikari: decent explanation of use of LDR as an L2 prefetch instruction on Neon here: http://forums.arm.com/lofiversion/index.php?t12665.html
[08:38:42] <transport> <michaedw> may change due to ChromeOS , ohh, so you think they may actually write SIMD optimisations for core libs to use the NEON and  can provide for up to 25% increase in application performance as claimed by the freevec guy for altivec, but will any such ARM ChromeOS patches be back ported upstream so every one can benefit , not least the ffmpeg C code calling these faster simd routines generally...
[08:40:45] <michaedw> transport: all I can say at the moment is that it would certainly be feasible to apply a certain amount of knowledge about NEON internals to the problem of speeding up memcpy on specific ARMv7 implementations
[08:41:16] <astrange> faster memcpy can be added directly to ffmpeg if it helps benchmarks
[08:41:24] <astrange> see fastmemcpy in mplayer
[08:41:40] <astrange> i complained the last time someone suggested porting that, but only to point out that we needed a benchmark for it
[08:41:53] <michaedw> and that Google and/or its hardware partners may have some economic interest in finding a way to do this without compromising relevant IP
[08:42:02] <astrange> (fastmemcpybench in mplayer is broken or at least not necessarily accurate, it very heavily favors whatever memcpy uses the most nontemporal instructions)
[08:42:36] <astrange> of course pointer swapping is preferred wherever we memcpy a lot
[08:42:49] <astrange> the fastest blitter is the one that doesn't exist
[08:43:05] <michaedw> the effort is probably better spent on something analogous to the kernel slub allocator
[08:43:37] <michaedw> for things like NAL units
[08:43:40] <roxfan> can you just copy apple's memcpy or it's protected somehow?
[08:44:22] <michaedw> or on using something more rope-like
[08:45:00] <michaedw> I like Vstr myself
[08:49:57] <roxfan> http://img256.imageshack.us/img256/2824/memcpyv7.png <- apple's armv7 memmove
[08:51:14] <kshishkov> looks more or less reasonable for memmove but not for memcpy
[08:51:27] <roxfan> memcpy is just a thunk to memmove
[08:51:37] <astrange> memmove and memcpy are the same function
[08:51:44] <kshishkov> ewww
[08:51:56] <astrange> one of memset(0 and bzero points to the other, i can't remember which
[08:52:23] <astrange> you can copy whatever you see at http://fxr.watson.org/ or opensource.apple.com
[08:52:27] * kshishkov remembers that wonderful change from android libc
[08:52:45] <astrange> memset(x,0 of course
[08:57:32] * _troll_ reporting for duty
[08:57:39] <KotH> kshishkov: the memset one?
[08:59:45] <roxfan> hm, bionic's memcpy has some neon
[09:00:57] <astrange> but just write your own instead of copying, it's more educational
[09:02:03] * elenril is surprised there's no flames about vp8 decoder
[09:02:18] <kshishkov> KotH: but of course
[09:02:51] <KotH> astrange: it's only educational, if someone else reviews it and tells you how to improve it :)
[09:02:51] <_troll_> elenril: wait for it...
[09:02:59] <kshishkov> elenril: why should they appear?
[09:03:04] <_troll_> and you listen
[09:03:20] <kshishkov> KotH: doing it may be an education too
[09:03:52] <elenril> kshishkov: because it's ffmpeg?
[09:04:00] <elenril> there are always flames
[09:04:26] <KotH> elenril: no flame today, flame tomorrow. there is always a flame tomorrow
[09:04:39] <kshishkov> elenril: we don't flame about having native decoders even for most crappy formats. Even for Jar-Jar Video
[09:04:55] <KotH> lol
[09:05:10] <Tjoppen> :)
[09:05:40] <transport> Description for http://freevec.org/function/memmove and http://freevec.org/function/memset with graphs
[09:07:01] <kshishkov> what purpose they serve except for cache size detection?
[09:13:20] <transport> the purpose seems to show that if you write or re-use his SIMD opimisations generally you get a large boost in thoughput, its working code so werth a look perhaps ? , dont loss anything by bringing these technics to the table
[09:13:25] <michaedw> how interesting do you consider out-of-order data arrival, over (say) RTP?
[09:13:31] <_av500_> astrange: android gcc is not CS, it is stock gcc with google patches
[09:14:21] <michaedw> google patches and google-cherry-picked-from-gcc-mailing-list patches
[09:14:22] <astrange> CS contributes back to upstream ARM code more than google, i mean
[09:14:31] <michaedw> the latter being mostly of CS origin
[09:15:02] <michaedw> and who writes the checks to CS?
[09:15:42] <michaedw> ARM and ARM licensees, mostly, where the ARM port is concerned
[09:15:52] <michaedw> including Android hardware partners
[09:15:59] <wbs> michaedw: what about rtp and out of order arrival?
[09:16:48] <michaedw> tends to result in an advantage for vstr and similar libraries
[09:17:25] <michaedw> the jitter buffer is just a vstr
[09:18:28] <wbs> uhmm, if you say so
[09:22:04] <_troll_> I wouldn't take his word for it
[09:22:13] <_troll_> whatever vstr means in this context
[09:22:42] <michaedw> first Google hit: http://www.and.org/vstr/
[09:23:22] <_troll_> oh that
[09:23:40] <_troll_> did anyone tell you you're not supposed to do heavy string processing in C?
[09:24:26] <michaedw> the string processing per se I could take or leave, the rope-like semantics without C++ may be appropriate for this job -- or may not
[09:24:51] <_troll_> rope is great... if you wish to hang yourself
[09:25:26] <michaedw> nah, I use a python for that; does the squeezing for you
[09:27:13] <_troll_> that I'll agree with
[09:32:46] <michaedw> does the rtp client support interleaved packetization mode?
[09:33:10] <wbs> michaedw: that's up to each depacketizer, but none of them supports it at the moment, afaik
[09:37:55] <michaedw> ah, right, it's Android's OpenCore that supports it
[09:39:17] <_av500_> oc ftw!
[09:39:32] <michaedw> the Apache-licensed portion
[09:41:21] <markuman> isn't this patch relevant? http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/2009-October/076749.html
[09:43:43] <michaedw> markuman: I don't think so
[09:55:57] <markuman> michaedw: but without this patch, i get this error message with the last frame when using -vcodec libx264
[09:56:30] <michaedw> oh, sorry, thought you mean relevant to my current hobbyhorse :-)
[09:57:08] <markuman> michaedw: no, general in ffmpeg
[10:04:31] <markuman> i wonder because this patch is one year old and ffmpeg has this error to this day
[10:06:22] <wbs> markuman: well, Michael posted some follow-up questions to that patch there, that didn't seem to get responded to, so feel free to pick up the patch and continue the review of it
[10:27:54] <markuman> wbs: That's too deep for me :) i just know that the patch is working for me
[10:32:48] * KotH feeds the _troll_ with some kebab
[10:33:21] <_av500_> shish?
[10:34:29] <KotH> nope, adana
[10:35:44] <_av500_> yum
[10:58:33] <DonDiego> do we have any vp8 samples?
[10:58:41] <DonDiego> there are none in the samples collection
[10:58:44] <Tjoppen> ooh, native vp8 finally commited
[10:59:02] <DonDiego> how does it compare speed-wise?
[10:59:48] <Tjoppen> judging from the ml it's already quite a bit faster than the reference implementation
[10:59:52] <thresh> i was about to ask that very same question
[11:00:38] <Tjoppen> but appearently not up to spec since the spec fails specify how to handle some of the stuff in the test vectors
[11:00:54] <mru> it is up to spec
[11:01:01] <mru> libvpx is buggy :-)
[11:01:13] <Tjoppen> hehe
[11:01:57] <mru> seriously
[11:02:03] <mru> libvpx does things not in the spec
[11:02:03] <kshishkov> DonDiego: feel free to put something like http://lachy.id.au/lib/media/elephantsdream/Elephants_Dream-720p-Stereo.webm
[11:02:05] <_av500_> i thoguht libvpx is the spec?
[11:02:28] <merbzt> no, just the ref code
[11:02:37] <DonDiego> kshishkov: how large is that sample?
[11:02:47] <mru> _av500_: JM is not the h264 spec
[11:02:57] <kshishkov> DonDiego: 148M IIRC
[11:03:14] <_av500_> mru: JM is not on2 :)
[11:03:40] <_av500_> merbzt: I know what it is supposed to be :)
[11:11:39] <janneg> DonDiego: your youtube-dl -f 45 $URL
[11:12:11] <kshishkov> and maybe some test vectors from webm wherever they are
[11:12:30] <janneg> libvpx with asm is here twice as fast as the native deceoder
[11:14:07] <kshishkov> we haven't got asm for native decoder yet, have we?
[11:14:26] <_av500_> wasnt BBB writing some?
[11:14:28] <janneg> 5.3s vs. 11.1s on http://www.youtube.com/watch?v=uDJXzm4R-U8
[11:14:37] <janneg> kshishkov: nothing committed yet
[11:14:57] <janneg> BBB and Dark_Shikari are busy writing some
[11:15:17] <_av500_> bbbusy
[11:15:31] <kshishkov> I know, so have you tried it with hand-applied asm patches?
[11:15:40] <janneg> no
[11:16:24] <kshishkov> so it's rather useless benchmark
[11:19:07] <_av500_> kshishkov: we could spin it :)
[11:19:16] <_av500_> native is only 3db last fast than libvpx
[11:19:20] <_av500_> less
[11:23:27] <andoma> wbs: ?
[11:23:36] <wbs> andoma: pong?
[11:23:53] <andoma> wbs: r23706 makes avio.h no longer freestanding
[11:24:27] <andoma> AVClass is not defined
[11:24:32] <wbs> ah, crap
[11:24:37] <janneg> kshishkov: Dark_Shikari's latest patch doesn't compile
[11:24:47] <andoma> i guess you either need to include it, or add 'struct' to it
[11:24:57] <andoma> to the line, .. i guess the latter is preferred
[11:25:07] <andoma> include log.h, that is
[11:25:29] <mru> just add the required #include
[11:25:44] <andoma> wbs: you fix?
[11:25:49] <wbs> yeah, will do
[11:25:52] <andoma> sweet
[11:27:06] <wbs> peloverde: libavcodec/aacps.h fails make checkheaders btw
[11:27:21] <CIA-99> ffmpeg: mstorsjo * r23734 /trunk/libavformat/avio.h: Add all required includes to avio.h
[11:27:22] <CIA-99> ffmpeg: mstorsjo * r23735 /trunk/libavformat/avio.c: Reindent
[11:27:37] <wbs> andoma: there you go
[11:29:44] <andoma> thanks
[11:34:32] <janneg> kshishkov: down to 8.9s with jason's latest patch
[11:38:46] <_av500_> apply it once more and you should be there
[11:45:05] <CIA-99> ffmpeg: diego * r23736 /trunk/libavcodec/aacps.h: Add required #includes to pass 'make checkheaders'.
[11:51:44] <janneg> _av500_: and if I apply it 5 times I'll have the decoded frames before I started to decode?
[11:51:59] <mru> no
[11:52:01] <_av500_> check your hdd, they are already there :)
[11:52:03] <mru> it's multiplicative, not additive
[11:52:26] <mru> otherwise, if you applied it enough times, you'd have the decoded frames before you even started patching
[11:52:44] <mru> that's basically how a time machine is built
[11:52:59] <_av500_> with vp8 asm patches?
[11:53:03] <janneg> just get it from the future
[11:53:20] <mru> _av500_: no, other patches
[11:53:22] <mru> additive patches
[11:53:36] <mru> unfortunately they haven't been discovered yet
[11:53:46] <mru> the guys at lhc are hoping to find them
[11:54:17] <_av500_> uhm, this irc channel already seems to loop back in time
[11:58:02] <mru> hmm, two mpeg4thread failures after the auto-pthreads change
[12:02:37] <mru> and sparc/openbsd
[12:02:46] <mru> smells like uninitialised data somewhere
[12:03:55] <KotH> the smell could also be sopie farting in her sleep
[12:09:25] <KotH> mru: that depends on the definition of definitly
[12:34:49] <KotH> .o0(stupid people are stupid)
[13:01:14] <KotH> mru: you have some contacts at arm, dont you?
[13:01:29] <mru> I know a few people, why?
[13:01:43] <KotH> mru: could you propose to the guys there to standardize usb device interface?
[13:02:12] <mru> no
[13:02:16] <mru> not their problem
[13:02:25] <KotH> mru: every arm vendor has its own usb device system and all of them have blatant bugs
[13:02:31] <CIA-99> ffmpeg: rbultje * r23737 /trunk/Changelog: Add missing changelog entry for VP8 decoder.
[13:02:31] <CIA-99> ffmpeg: rbultje * r23738 /trunk/libavcodec/vp8data.h: Fix a typo, spotted by Diego.
[13:02:38] <mru> complain to the chip vendors
[13:02:46] <KotH> lol
[13:02:48] <KotH> i did with atmel
[13:04:25] <av500> KotH: thats not arms fault
[13:05:00] <KotH> after half a year discussing with them an easily reproducible race condition in their hw<->sw interface, i got them far enough to tell me that their silicon requires you to write to the registers in a specific way and that the documentation is "right" (although if you do it as writtne in the documentation you'll end up with a horrible race condition that is so easy to trigger that you can be sure that your customer will trip over it)
[13:05:48] <KotH> av500: it might be not their fault, but haveing one sane interface would also simplfy programming for different chips
[13:05:59] <KotH> av500: and they did it for the interrupt system too
[13:06:06] <av500> that is different
[13:06:47] <iive> KotH: hum... i've heard this firm. I should check if we use something from them.
[13:07:39] <KotH> iive: it's one of the biggest uC manufacturers, especially in the lower power range
[13:08:27] <mru> most chips implement ehci actually
[13:08:32] <KotH> iive: oh.. and if you are using their example code anywhere... drop and rewrite it... it's not worth the effort to fix it
[13:08:36] <mru> but they always need some lower level stuff
[13:08:40] <mru> power management etc
[13:08:47] <mru> you'll never get that standardised
[13:08:56] <KotH> mru: ehci is the host interface. i'm talking about the device interface
[13:09:35] <mru> well, there about half the chips use the buggy mentor usb block
[13:10:09] <av500> KotH: irq system is much closer to the actual cpu core than usb block
[13:10:26] <KotH> mru: power managment standardization is quite easy with usb
[13:10:36] <mru> no
[13:10:39] <KotH> mru: because the complete behavior is defined by usb already :)
[13:10:41] <mru> link power perhaps
[13:10:59] <av500> KotH: also, looking at e.g. TI, it is not the actual HW blocks that make problems, it is the interconnects that they always mess up
[13:11:22] <av500> smae for other SOC vendors
[13:11:26] <mru> the usb spec doesn't say anything about cpu<->controller interface
[13:11:29] <mru> as you've noticed
[13:11:43] <KotH> av500: hmm.. the mps430s we use here work like a charm
[13:11:56] <KotH> av500: didnt hit any silicon bug until now
[13:12:04] <av500> msp430 is to the OMAP3 what ffmpeg is to the boston strangler
[13:12:09] <mru> KotH: you're not using ffmpeg enough then
[13:12:39] <KotH> mru: i doubt that ffmpeg runs on a msp430 ;)
[13:12:52] <mru> see?
[13:13:08] <av500> KotH: we threw out msp430 for atmega in our designs :)
[13:13:44] <KotH> av500: why?
[13:14:07] <av500> i guess it was 0.2c cheaper
[13:26:41] <lu_zero> uhm
[13:35:56] <mru> BBB: ping
[13:36:17] <BBB> mru: pong
[13:36:24] <BBB> if it's about dest/srcstride, I'll fix that
[13:36:32] <mru> ok
[13:36:55] <BBB> just one of those things I didn't get to and then I forgot
[13:37:09] <mru> you should never have written such code in the first place
[13:37:43] <BBB> you forget that I've never written a video decoder before
[13:37:52] <BBB> be patient while I learn ;-)
[13:38:01] <mru> VLAs are dangerous everywhere
[13:38:05] <mru> not just in video decoders
[13:38:20] <BBB> no I know, but it's easy to forget it here or there
[13:38:36] <BBB> it's like a variable declaration halfway, sometimes it's useful to just do one for debugging and then you forget about it
[13:38:36] <mru> you should never, ever write one in the first place
[13:38:46] <BBB> ok, ok
[13:38:55] <mru> I'm going to make it an error in ffmpeg
[13:38:56] <BBB> I'll remove it, really, before I aply any mmx/thing patch
[13:39:01] <BBB> sure
[13:39:19] <BBB> (gcc has an option for that?)
[13:39:23] <mru> yes
[13:39:27] <BBB> use it
[13:39:38] <mru> I need to eradicate the existing ones first
[13:39:40] <BBB> that way I have no excuses left :)
[13:47:23] <lu_zero> yawn
[13:47:50] <Honoome> lu_zero: if you're bored I have a few things you could do
[13:48:13] <lu_zero> right now I'm collecting my strength to debug something _quite_ strange in ffmpeg
[13:48:28] <Honoome> too bad
[13:48:42] <lu_zero> the question is how multiple outputs in ffmpeg got broken?
[13:49:08] <lu_zero> Honoome: you found some new bugs?
[13:49:56] <Honoome> lu_zero: daily.. but which project are you interested in?
[13:59:08] <mru> what is the purpose of ff_lpc_compute_autocorr?
[13:59:15] <mru> it appears unused
[14:00:40] <jai> the plain c version is still useful
[14:00:53] <jai> if --disable-asm is used that is
[14:01:20] * mru needs better grepping skills
[14:01:34] <mru> what uses it?
[14:04:16] <jai> mru: flac, alac encoders right
[14:04:28] <jai> lavc/lpc.c
[14:04:34] <mru> I know where it is
[14:04:34] <jai> am i missing something?
[14:04:52] <mru> just not in the mood for tracing calls backwards
[14:05:17] <Honoome> mru: cscope helps
[14:05:19] <jai> mru: the lpc coeff calculation code
[14:05:32] <mru> jai: duh
[14:05:45] <mru> Honoome: not when the calls are done via function pointers
[14:06:14] <jai> so flacenc, alacenc and ra14.4k enc
[14:06:25] <Honoome> mru: good point
[14:19:57] <lu_zero> mru: Looks like we should add the chained generation to regtest
[14:20:23] <mru> what?
[14:25:50] <lu_zero> ffmpeg -i foo -s size out.size.blah -s size2 out.size2.blah is broken
[14:27:29] * lu_zero writes a stupid script and let bisect do the rest...
[14:35:12] * lu_zero screams about ffmpeg being hardly bisectable thanks to libswscale...
[14:42:01] <mru> Vitor1001: ping
[14:42:23] <Vitor1001> mru: pong
[14:42:27] <KotH> lu_zero: in irc, nobody hears you scream
[14:42:28] <KotH> ;)
[14:42:47] <mru> Vitor1001: lp_order arg of ff_acelp_lp_decode()
[14:42:55] <mru> is always 10 as used now
[14:42:56] <Honoome> KotH: that's why I usually phone him to scream at him :P
[14:43:20] <lu_zero> mru: how's your situation? Still having a impending deadline?
[14:43:22] <mru> Vitor1001: is there an upper limit for what this will ever be?
[14:43:39] <Vitor1001> mru: it's unused ATM
[14:43:58] <Vitor1001> I think you can safely assume 16 as max.
[14:44:02] <mru> it's callled from g729dec.c
[14:44:11] <mru> with a value of 10
[14:44:28] <lu_zero> Honoome: right now I envy you and mru manycores
[14:44:40] <mru> sipr16k has a float version of the same function with max 16
[14:44:50] <Honoome> lu_zero: and I'm in need of a manicure...
[14:45:13] <Vitor1001> mru:g729dec.c is uncomplete and never compiler
[14:45:37] <mru> lu_zero: I have only 12 core2-class cores
[14:45:48] <Honoome> damn you beat me
[14:45:54] <Honoome> I knew I should have updated to Instanbuls
[14:46:22] <mru> only 8 of them currently powered up
[14:46:46] <mru> Vitor1001: #define MAX_LP_HALF_ORDER 8
[14:46:54] <lu_zero> Instanbuls?
[14:47:02] <mru> lu_zero: turkish cpu
[14:47:09] <lu_zero> tasty!
[14:47:13] <mru> you can grill kebab on them
[14:47:39] <Honoome> lu_zero: Barcelona â†’ Quad Opteron; Instanbul â†’ Hex Opteron
[14:47:44] <Vitor1001> mru: fine for me
[14:47:57] <mru> Vitor1001: what is fine?  that define exists
[14:48:09] <mru> is half_order half of order?
[14:48:38] <Vitor1001> mru: yes.
[14:48:53] <mru> so making max order twice that number is safe?
[14:48:57] <mru> and sensible
[14:50:22] <lu_zero> Honoome: is there a laptop with it?
[14:50:45] * lu_zero is wondering how much money to waste
[14:52:24] <Honoome> lu_zero: don't think so, but you can get the same dell as I got if you don't carea bout the touchpad
[14:53:08] <Vitor1001> mru: Yes. The "half_order" in the name is not just some random approximation ;)
[14:53:30] <mru> I wasn't sure there wasn't some +1 or something involved
[14:53:39] <mru> or the half referring to something other than the size
[14:53:57] <lu_zero> Honoome: how much did you pay?
[14:53:59] <Vitor1001> mru: I understand, it happens more than often...
[14:54:06] <Honoome> lu_zero: 1980 :P
[14:54:46] * Honoome is just setting a new IP(v6) up on his dns rather than using /etc/hosts... crazy?
[14:56:43] <lu_zero> 2284 for a similar apple...
[14:56:57] * lu_zero isn't that convinced about the dell monitor
[14:57:06] <Honoome> the dell monitor is _gorgeous_
[14:57:27] <lu_zero> really?
[14:57:29] <Honoome> better than the macs.. and it's not just my opinion but of my typographer (or however that translates) as well
[14:57:38] <lu_zero> uhmm
[14:58:26] * lu_zero has 10 dell to be configured for esof and well... they are gouging my eyes already...
[14:58:41] <Honoome> laptops or standalone monitors?
[14:58:48] <lu_zero> laptops
[14:59:11] <Honoome> *shrug* it's your own fault for having cancelled my trip to geneve :P
[14:59:39] <lu_zero> Honoome: you are more than welcome, the room has 4 beds...
[15:00:24] * mru likes the sony laptop screen
[15:01:04] <lu_zero> I thought the 6+3h would be a bit deadly ^^;
[15:01:44] <Honoome> 6+3? o_O
[15:01:59] <Honoome> what do you relly think I would have come to Turin first, anyway? :P
[15:02:13] <lu_zero> 6 Venezia-Torino + 3 Torino-Geneve
[15:02:19] <lu_zero> Honoome: Puria wants you
[15:04:05] <Honoome> lu_zero: I  have come to Turin at another time.. my idea was that already ^^ -- Venice-Geneve wouldn't be too bad itself, as I wouldn't have to switch train at least
[15:05:54] <kshishkov> mru: too bad those screens are handicapped with sony laptop
[15:06:12] <mru> kshishkov: what do you have against sony?
[15:06:25] <av500> sony makes good headphones
[15:07:24] <kshishkov> mru: nothing except their proprietary lock-in behaviour and overpricing their products
[15:07:43] <mru> the price is annoying, sure
[15:07:53] <av500> mru: what did u pay?
[15:07:55] <mru> but I see no lockin
[15:07:57] <av500> and what config?
[15:08:12] <kshishkov> mru: ask Benjamin about ATRAC, for example
[15:08:26] <av500> who cares about atrac
[15:08:56] <mru> av500: i5 M540, 8GB, 500GB, 1080p, Â£lots
[15:09:12] <kshishkov> av500: minidisc users
[15:09:17] <mru> the laptop plays other formats than atrac
[15:09:20] <lu_zero> how much in euro?
[15:09:26] <mru> â‚¬lots
[15:09:30] <av500> mru: no SSD?
[15:09:30] <lu_zero> ...
[15:09:39] <mru> ssd isn't worth the overprice
[15:10:04] <kshishkov> av500: for 500GB SSD it would be EUR insanelylots
[15:10:21] <benoit-> lu_zero: â‚¬(lots*1.21)
[15:10:27] <lu_zero> thehe
[15:10:54] * lu_zero wants a faster laptop AND a unified git
[15:11:05] <mru> I can give you the git
[15:11:15] <lu_zero> would be great
[15:11:23] <lu_zero> so I could leave it bisecting unattended
[15:11:24] <kshishkov> lu_zero: faster than my Gdium or faster than what you have now?
[15:11:42] <lu_zero> kshishkov: faster as in "find the bug NOW"
[15:12:13] <lu_zero> faster as in "solve it" could be too much though
[15:12:18] <av500> Total PriceÂ£ 1,799.00 inc. VA
[15:18:27] <mru> damn this code is ugly
[15:18:30] <mru> shorten.c
[15:20:11] <mru> at least it's good to see our standards have improved
[15:22:11] <kshishkov> look into something older
[15:23:31] <av500> mru: your config is 2089.00â‚¬ here
[15:24:14] <Honoome> lu_zero: feel like adding sctp support to lsof or netstat or iproute2 while your laptop looks for the bug? :D
[15:24:34] <lu_zero> uhmm
[15:24:42] <lu_zero> mans where is the unified git?
[15:24:50] <mru> I don't have one
[15:24:55] <lu_zero> iproute2 doesn't support sctp?
[15:25:03] <Honoome> lu_zero: ss doesn't
[15:25:06] <mru> but I can make one
[15:25:12] <mru> when we decide to finally switch
[15:25:29] <lu_zero> after this I think I have a good case for that -_-
[15:25:31] <mru> doing it properly is too much work to do before that
[15:26:06] <Honoome> what are we waiting for? :)
[15:26:14] <mru> godot
[15:26:25] <Honoome> that's whom we're waiting for
[15:26:28] <lu_zero> mru: he's already arrived
[15:36:23] <lu_zero> ...
[15:36:26] <lu_zero> there
[15:37:12] <lu_zero> http://ffmpeg.pastebin.com/6LCXCwsW
[15:37:25] <mru> hehe
[15:38:11] <lu_zero> avfilter ....
[15:38:45] * Honoome mutters a few bad words... why on earth did he decide to go with slackware for kernel hacking, AT ALL?!
[15:39:05] <Honoome> [answer, because it looked faster than installing Gentoo; it was the wrong answer though
[15:39:50] <lu_zero> ...
[15:39:58] <lu_zero> we should have readytogo stage4
[15:40:06] <lu_zero> _really_ ready
[15:40:08] <mru> a base gentoo install is quite fast
[15:41:23] <Honoome> mru: in kvm?
[15:41:43] <mru> I don't use kvm
[15:41:45] <Honoome> well certainly if I count download time, plus setup time, plus finding out they don't really make much sense time...
[15:41:46] <mru> I have enough real machines
[15:42:00] <Honoome> mru: I'm short of convenience hardware to hack the kernel on
[15:42:23] <lu_zero> mru: do you have time help me putting a test to trigger the problem?
[15:42:24] <mru> beagles are great for kernel hacking
[15:43:04] * lu_zero has the script but is a bit lost on the Makefile
[15:43:38] * mru should tidy up the regtest part of the makefile some
[15:47:19] <lu_zero> basically I need to run ffmpeg 3 times and compare the outputs
[15:47:51] <mru> what are you testing?
[15:54:14] <lu_zero> http://ffmpeg.pastebin.com/8G1i6EU3
[15:56:56] <mru> why so complicated?
[15:57:24] <mru> run the commands separately and generate checksums etc
[15:57:38] <mru> then compare those against the files generated by the combined command
[15:57:53] <lu_zero> uhm
[15:58:11] <mru> I mean create the checksums outside the test script
[15:58:12] <mru> manually
[15:58:33] <lu_zero> ok
[15:58:36] <lu_zero>  that part is simple
[16:11:30] <xxthink> Are there some tools to output the pts of a specific mp4 fileï¼Ÿ
[16:19:24] <j0sh_> wbs: i figured it out *just before* i got your email
[16:19:28] <j0sh_> d'ohhh
[16:19:38] <j0sh_> sleep does help :)
[16:19:40] <wbs> :-)
[16:19:45] <wbs> yes, it usually does
[16:20:14] * lu_zero needs some...
[16:20:18] <Honoome> not with idiotic ruby packages, not today as well not today as well =_=
[16:20:25] <wbs> after fixing the nitpicks I mailed about, and that little issue, I think it should be quite ok, but I guess I'll read it through once more in more detail then
[16:20:42] <wbs> and wait for comments from lu_zero and BBB if they want to have a say on it
[16:20:45] <BBB> I'll review the next iteration also
[16:20:47] <j0sh_> alrighty
[16:20:50] <BBB> was wacthing the US game
[16:20:54] <BBB> that was fun :)
[16:21:07] <j0sh_> yeah i think i forgot to format-patch before i sent that round of patches out
[16:21:25] <lu_zero> hopefully I'll be able to review after some rest...
[16:22:07] <av500> BBB: shouldn't KotH be negotitiating with them :)
[16:22:18] <BBB> ?
[16:22:34] <av500> turkish telecom...
[16:22:42] <BBB> why him? :-p
[16:23:33] <j0sh_> is there a way to review the history of a particular file?
[16:23:39] <av500> svn log
[16:23:40] <av500> svn blame
[16:24:03] <j0sh_> or in git? i know about blame, but it only gives me the most recent change
[16:24:04] <av500> svn diff -c <changeset>
[16:24:13] <Honoome> j0sh_: git log :P
[16:24:17] <Honoome> git log $filename
[16:24:24] <j0sh_> ok
[16:24:32] <lu_zero> j0sh_: gitk helps
[16:31:54] <j0sh_> lu_zero: wow. gitk is pretty cool
[16:44:09] <lu_zero> =)
[17:07:28] <sjhor_> Could anyone tell me the purpose of the two left_mb_xy values in lavc's h264 decoder?
[17:14:48] <sjhor_> Actually never mind I see what's going on
[17:24:27] <wbs> j0sh_: you could have a look at gitg, too, I prefer its graphical output to gitk
[17:33:59] <j0sh_> wbs: cool, will check that out. digging through the ffmpeg commit history is fun, i could do this all day :)
[17:34:54] <j0sh_> found the commits that added in mpeg4 and aac support also, so the (c) will be fixed in the next round of patches
[17:42:17] <BBB> Dark_Shikari: so what are the issues preventing a commit?
[17:42:36] <BBB> apart from the huge VLA mru complained about, will fix that now
[17:46:31] <Dark_Shikari> BBB: did you get my patch?
[17:46:42] <BBB> I integrated half of yours (mmx changes)
[17:46:47] <Dark_Shikari> You will have to modify all of the asm functions to take two strides
[17:46:50] <BBB> I didn't integrate the sssssse stuff because I can't test it
[17:46:56] <Dark_Shikari> Um, just locally commmit it
[17:47:01] <Dark_Shikari> you don't have to be able to test the ssse3
[17:47:10] <Dark_Shikari> I tested it for you
[17:47:15] <BBB> haha :) ok
[17:47:19] <Dark_Shikari> and if you want I can give you ssh access
[17:47:22] <Dark_Shikari> to test things
[17:47:28] <Dark_Shikari> now, so first of all, the VLA
[17:47:29] <BBB> nah, I'll keep begging for a better cpu
[17:47:42] <BBB> yeah, I'll fix the vla, mru already bugged me
[17:47:43] <Dark_Shikari> You'll have to modify all the asm functions (albeit trivially, and in the same way) to fix this
[17:47:46] <Dark_Shikari> do you see why?
[17:47:48] <BBB> if you have a patch, go commit it
[17:48:02] <BBB> yeah, because they're gonna get two different strides
[17:48:10] <BBB> so you need to remove the sub r0, r1
[17:48:15] <BBB> and use two adds instead of one
[17:48:22] <janneg> Dark_Shikari: yasm complained on x86_64
[17:48:24] <BBB> add r1, src_stride; add r0, dest_stride
[17:48:25] <BBB> or so
[17:48:30] <BBB> I forgot which one is r0
[17:48:41] <BBB> and you need to change the number of registers used in each function from 5 to 6
[17:48:51] <BBB> I think that's all
[17:48:54] <Dark_Shikari> janneg: yes, PIC is broken
[17:48:55] <BBB> is there more I'd need to change?
[17:48:55] <janneg> Dark_Shikari: all the 'FIXME prevent this on X86_64'
[17:49:06] <Dark_Shikari> janneg: that's not the problem
[17:49:08] <Dark_Shikari> the problem is that I broke PIC
[17:49:28] <Dark_Shikari> oh, BBB, the other thing I noted
[17:49:31] <Dark_Shikari> the 8x8 functions are never used
[17:49:35] <Dark_Shikari> if splitmv is on, it does all 4x4 MC
[17:49:36] <janneg> Dark_Shikari: libavcodec/x86/vp8dsp.asm:537: error: invalid size for operand 1
[17:49:37] <Dark_Shikari> which is going to suck
[17:49:55] <BBB> 8x8 should be used for chroma of non-splitmv
[17:50:00] <BBB> if it's not used, there's a bug somewhere
[17:50:07] <Dark_Shikari> BBB: ok, true
[17:50:09] <Dark_Shikari> but still
[17:50:14] <Dark_Shikari> fix that because it's going to be so much faster
[17:50:26] <BBB> ?
[17:50:31] <BBB> what should I fix?
[17:50:39] <Dark_Shikari> the fact that splitmv == 16 MC calls?
[17:50:42] <Dark_Shikari> instead of 1 MC call per partition?
[17:50:49] <BBB> ooh, I see what you mean
[17:50:52] <BBB> yeah ok, will do
[17:51:14] <BBB> that's not trivial, that might take me a day or two
[17:51:18] <Dark_Shikari> >dc_add/mmx is +/- 90 cycles faster
[17:51:22] <Dark_Shikari> You mean 9
[17:51:27] <Dark_Shikari> start_timer is measured in dezicycles.
[17:51:30] <BBB> yes
[17:51:41] <BBB> dezi is german?
[17:51:47] <BBB> brits say deci
[17:51:56] <Dark_Shikari> 1/10th
[17:52:01] <Dark_Shikari> no idea
[17:52:02] <janneg> yes, one thenth
[17:52:04] <Dark_Shikari> mru: awesome VLA killing
[17:55:35] <BBB> Dark_Shikari: but that second does not prevent me from committing this patch
[17:55:46] <BBB> it just prevents it from having full effect
[17:55:47] <Dark_Shikari> no it doesn't
[17:55:53] <Dark_Shikari> I just wanted to mention it.
[17:55:56] <BBB> ok
[17:55:58] <Dark_Shikari> so basically
[17:56:07] <Dark_Shikari> a) commit my patch locally
[17:56:11] <Dark_Shikari> b) fix vlas in all functions
[17:56:17] <BBB> I don't have your patch :-p
[17:56:22] <Dark_Shikari> Um, I emailed it out...
[17:56:24] <Dark_Shikari> ....
[17:56:29] <BBB> which addy?
[17:56:35] <Dark_Shikari> ffmpeg-devel?!?!?!
[17:56:47] <BBB> which thread?
[17:56:52] <Dark_Shikari> VP8 MMX optimizations?
[17:56:53] <Dark_Shikari> durrhrhrhrh?
[17:57:00] <Dark_Shikari> did you catch the stupid bug today?
[17:57:25] <BBB> I probably didn't read every email or so
[17:57:38] <Dark_Shikari> Um, how about... the most recent one?
[17:57:41] <Dark_Shikari> -.-
[17:58:01] <Dark_Shikari> ok, so you caught the stupid bug today
[17:58:07] <BBB> probably
[17:58:13] <BBB> let me get an axe
[17:58:14] <Dark_Shikari> Things you need to fix:
[17:58:27] <Dark_Shikari> 1) Vararray and all of the functions that take only one stride
[17:58:33] <Dark_Shikari> this can mostly be done by global search/replace etc
[17:58:49] <Dark_Shikari> Make sure to disable SSE2 functions to test the MMX ones and so forth
[17:59:04] <Dark_Shikari> Fix the SSSE3 ones even if you can't fix them; I'll check it for you when you're ready.
[17:59:39] <Dark_Shikari> 2) mov r4, r5m apparently broke something on x86_64 according to janneg.  I have no idea what kind of crack he's on, but a simple way to handle that is to get rid of it.  To do this, simply eliminate that and use r5 instead (5,5 instead of 4,5 in the cglobal).
[17:59:56] <Dark_Shikari> This is, again, a very simple single change applied to all functions.
[17:59:58] <BBB> yeah, that's what the FIXME was for
[18:00:06] <Dark_Shikari> Now, once you do these
[18:00:11] <Dark_Shikari> give me the patch, and I'll fix the following:
[18:00:18] <Dark_Shikari> 3) PIC is broken entirely
[18:00:28] <Dark_Shikari> btw, your code was _also_ broken with PIC previously
[18:00:42] <Dark_Shikari> because you assumed the x264 and ffmpeg versions of the x264asm headers matched
[18:00:45] <Dark_Shikari> They didn't.
[18:00:45] <Dark_Shikari> I fixed that.
[18:00:52] <Dark_Shikari> (Not your fault, there's no way you could have known)
[18:01:09] <janneg> Dark_Shikari: yasm 1.0.1.2326
[18:01:11] <BBB> I was about to say, you never told me about pic except that x264inc.asm did something for me related to that
[18:01:55] <Dark_Shikari> BBB: here's the rules about PIC with the latest x264asm
[18:02:05] <Dark_Shikari> 1) PIC is only supported on x86_64.
[18:02:29] <Dark_Shikari> 2) PIC is supported using "wrt rip".  That is, constants are referenced using an offset from the instruction pointer.
[18:02:42] <Dark_Shikari> This is done automatically by yasm (x264asm just turns it on).
[18:02:43] <Dark_Shikari> HOWEVER
[18:03:01] <Dark_Shikari> you cannot do [globalconstant + r4*8 + r2 + 15 wrt rip]
[18:03:03] <Dark_Shikari> too complicated!
[18:03:28] <Dark_Shikari> thus, in PIC, you would have to do something like this:
[18:03:37] <Dark_Shikari> lea r11, [globalconstant]
[18:03:45] <Dark_Shikari> add r11, r2
[18:03:51] <Dark_Shikari> load from [r11 + r4*8 + 15]
[18:04:04] <Dark_Shikari> For cases like this, you can do %ifdef PIC
[18:05:07] <BBB> maybe you should make a macro for that
[18:05:49] <Dark_Shikari> See common/x86/cabac-a.asm for an example.
[18:05:57] <Dark_Shikari> The reason there isn't one is because it's rarely needed
[18:06:02] <Dark_Shikari> almost all times you load a global constant, it's just [pw_64]
[18:06:09] <Dark_Shikari> not [pw_64+r4*8+r2+...]
[18:06:25] <Dark_Shikari> Oh, and in the new x264asm, the ff_ prefix is automatically added.
[18:06:43] <Dark_Shikari> Both to function names and to constant names from ffmpeg.
[18:06:52] <Dark_Shikari> So [pw_64] will reference [ff_pw_64]
[18:07:00] <Dark_Shikari> this is set by %define program_name ff in x86inc.asm
[18:07:01] <Dark_Shikari> See my patch.
[18:07:11] <BBB> ok
[18:08:06] <Dark_Shikari> If you want, I can apply the "update x264asm" changes right now for you.
[18:08:07] <Dark_Shikari> is that ok?
[18:08:17] <BBB> of course
[18:08:18] <Dark_Shikari> this will make the patch smaller and simplify things
[18:08:20] <BBB> that makes it easier for me
[18:08:22] <Dark_Shikari> I hope I don't step on anyone's toes
[18:08:28] <Dark_Shikari> since it does slightly involve modifying other asm
[18:09:32] <Dark_Shikari> what's the regression test command again?
[18:10:04] <_av500_> make test?
[18:10:29] <Dark_Shikari> k
[18:11:13] <mru> lol
[18:11:25] * BBB goes do real work for a little
[18:11:29] <mru> add -j$bignum if you have the cores
[18:11:39] <Dark_Shikari> what does it do when it fails
[18:11:52] <mru> stops
[18:11:57] <mru> add -k to keep going
[18:14:18] <wbs> j0sh_: yeah, a good way of browsing history is a vital part of a version control system
[18:15:29] <_av500_> wbs: enterprise sw does that inline in the file, just follow the #ifdefs
[18:15:48] <wbs> _av500_: yeah, that's quite scary
[18:16:33] <mru> Dark_Shikari: btw, I found several stride-sized vlas in snow
[18:16:38] <Dark_Shikari> ouch
[18:17:11] <Dark_Shikari> BBB: so in short, you fix 1) and 2), I'll make it cross-platform and test it.
[18:17:15] <Dark_Shikari> then we can commit.
[18:18:10] <_av500_> wbs: and old versions can always be found on that coworkers smb share... ;)
[18:18:15] <Dark_Shikari> also, I'd like you to try to make an effort to understand my functions
[18:18:18] <Dark_Shikari> and ask questions if you need to
[18:18:25] <Dark_Shikari> this is a learning experience too, and I never taught you ssse3!
[18:32:56] <wbs> _av500_: oooooh, yeah. version control through smb shares.. don't make me start crying ;P
[18:34:06] <spaam> wbs: and put it on a dropbox share. can it be better? :)
[18:37:37] <mru> ftp!
[18:37:50] <lu_zero> printouts!
[18:37:51] <mru> with 0.9-time passwords
[18:39:46] <Dark_Shikari> mru: btw, did you notice with that patch the incredible stupidity of vp8 mc?
[18:39:58] <mru> I didn't look at the details
[18:39:59] <Dark_Shikari> it's "separable mc", but you can only do it one way...
[18:40:02] <Dark_Shikari> because you do the H pass first
[18:40:07] <Dark_Shikari> then you round back to 8-bit
[18:40:11] <mru> omg
[18:40:11] <Dark_Shikari> and then you do the V-pass on the rounded data
[18:40:23] <Dark_Shikari> Yes, you round _twice_.
[18:40:28] <Dark_Shikari> Gratuitous loss of precision for no reason.
[18:40:31] <mru> twice as round
[18:40:36] <Dark_Shikari> It makes the asm a bit easier to write
[18:40:38] <Dark_Shikari> but not any faster
[18:40:46] <Dark_Shikari> and loses compression for no reason
[18:41:09] <lu_zero> what happens when you remove them then ^^?
[18:41:49] <Dark_Shikari> remove what
[18:42:06] <lu_zero> the unnecessary round
[18:42:28] <Dark_Shikari> it'll be wrong obviously
[18:43:11] <BBB> Dark_Shikari: I understand what you did to v4/v6, and the dc_add looks pretty straightforward also (same as the x264 one we looked at)
[18:43:25] <lu_zero> well it's vp8.1 material
[18:43:33] <BBB> I haven't looked at the sse4 one yet, and I'll ask questions from there on (ssse3 also)
[18:44:20] <BBB> Dark_Shikari: and sse2 same thing; I actually want to learn sse2, it's useful for me
[18:44:27] <BBB> ssse3 will have to wait until I have a new cpu
[18:45:12] <Dark_Shikari> sse2 is basically the same as mmx
[18:45:16] <Dark_Shikari> just bigger
[18:45:24] <Dark_Shikari> fyi, I am _NOT_ satisfied with the sse2 mc code
[18:45:40] <Dark_Shikari> way too much overhead
[18:45:57] <Dark_Shikari> also, the "shifting down" trick for your V mc does save time, but I think that if we unrolled it by 2x
[18:46:00] <Dark_Shikari> we could eliminate most of it
[18:46:07] <Dark_Shikari> shifting down == m1 -> m0, m2 -> m1, etc
[18:46:32] <Dark_Shikari> if you unroll by 2x, you can just shift up, then shift down, repeatedly.
[18:46:40] <Dark_Shikari> i.e. alternate register patterns
[18:46:47] <Dark_Shikari> But this comes later, not intending to do it now.
[18:47:45] <Dark_Shikari> and feel free to throw ideas at me -- I definitely did not think of every possible option.
[18:49:11] <BBB> I'll get brighter ideas once I get good at this :)
[18:49:21] <BBB> just need to actually do real work now ;)
[18:49:29] <Dark_Shikari> this is real work
[18:50:05] <BBB> I don't get paid to do this :-p
[18:50:07] <Dark_Shikari> btw, all that mc code is the result of what I'd call an "MC marathon"
[18:50:10] <Dark_Shikari> when you take a day and say
[18:50:14] <Dark_Shikari> "fuck it I'm writing all the MC code"
[18:50:17] <BBB> mru: does stefan gehrer have svn access? or should I apply for him?
[18:50:18] <Dark_Shikari> I did this for h264 too a while back =p
[18:50:49] <mru> he should
[18:55:02] <Honoome> mru: see, everything comes around.. last night the farmville guy made me want to kill a huge proportion of the human race? now fatelf makes me wish to kill even more
[18:55:33] <_av500_> fatelf still exists?
[18:55:34] <mru> fatelf still exits?
[18:55:37] <mru> +s
[18:55:54] <_av500_> :)
[18:56:42] <Honoome> it does... and Ryan Gordon or someone who likes him is still going on to talk about it it seems
[18:56:52] <Honoome> there's an article on LWN about a talk about Ryan Gordon's "failures"...
[18:57:09] <Honoome> and people still think that "it has its uses"
[18:57:25] <mru> well, he was mercilessly evicted from lkml
[18:58:09] <Honoome> sure... a file that will _only_ load on an OS with modified kernel and modified loader, which is _not_ going to be smaller than the sum of the files it would replace (and that would be usable on _any_ kernel and _any_ loader... with restrictions of course)...
[19:21:39] <CIA-99> ffmpeg: darkshikari * r23739 /trunk/libavcodec/x86/ (6 files):
[19:21:39] <CIA-99> ffmpeg: Update x264asm header files to latest versions.
[19:21:39] <CIA-99> ffmpeg: Modify the asm accordingly.
[19:21:39] <CIA-99> ffmpeg: GLOBAL is now no longoer necessary for PIC-compliant loads.
[19:22:12] <Dark_Shikari> BBB: ^
[19:22:53] <mru> Dark_Shikari: extra funny, one of the vla[stride] in snow was unused
[19:22:58] <Dark_Shikari> lol
[19:23:07] <mru> there's tons of unused stuff there
[19:25:08] <j0sh_> lu_zero: Honoome: what bottlenecks feng? ram or network i/o?
[19:25:27] <Honoome> j0sh_: cpu.. we're mostly single-threaded for clients' handling
[19:25:45] <Honoome> plus we have a bad I/O hit by the DESCRIBE calls as we don't cache the video-on-demand results
[19:26:13] <j0sh_> you guys probe the media after each describe, right?
[19:27:03] <Dark_Shikari> well, lets see how many fate tests I broke
[19:27:15] <Dark_Shikari> how long does fate usually take to respond?
[19:29:32] <mru> grab the samples and run make fate yourself
[19:29:36] <Honoome> j0sh_: yeah, bloody slow :)
[19:29:52] <Honoome> mru: "make your own fate" :D
[19:30:01] <mru> :-)
[19:30:52] <CIA-99> ffmpeg: alexc * r23740 /trunk/libavcodec/ (8 files): aactab: Tablegenify ff_aac_pow2sf_tab.
[19:32:55] <CIA-99> ffmpeg: alexc * r23741 /trunk/libavcodec/Makefile: Fix alphabetization of the CONFIG_HARDCODED_TABLES Makefile section.
[19:43:18] <wbs> J_Darnley: I hope you noted someone replied to the libvorbis with >2 channels thread a few days ago, I don't think there's much left stopping getting that issue fixed
[19:45:52] <dgt84> how does one write e.g. a stats file in an encoder or filter if you can't use fprintf inside of libav*?
[19:46:25] <_av500_> open, write close?
[19:47:01] <mru> wrong answer
[19:47:07] <mru> right answer is you don't
[19:47:25] <mru> check how mpeg[124] does it
[19:49:05] <Dark_Shikari> something something api incompatibility something
[19:49:27] <mru> stop trolling
[19:49:29] <mru> you know it works
[19:49:32] <Dark_Shikari> oh, I know it works
[19:49:42] <Dark_Shikari> I'm just noting how it's incompatible with the apis of some other apps that do it the other way
[19:49:57] <Dark_Shikari> actually, wait, remind me.  lavc does it via writing to a pointer provided by the user, right?
[19:50:14] <mru> something like that
[19:50:16] <Dark_Shikari> how does it know the size of the memory available to write to?
[19:50:18] <mru> lavc does no file i/o
[19:58:26] <dgt84> looks like snprintf is the way to go according to libavcodec/ratecontrol.c
[19:59:34] <dgt84> then just write
[20:12:38] <J_Darnley> wbs: I forgot that!
[20:13:05] <J_Darnley> I should have dealt with that right away before reading the rest of my mails
[20:20:08] <Honoome> "a branch has negligible overhead if not followed" ... after telling that fatelf would be useful only to custom vendors... such vendors that exist almost entirely in embedded... where a 5% overhead in kernel size is going to make the supervisor's head explode...
[20:21:48] <_av500_> embedded does not need fatelf
[20:22:29] <Honoome> _av500_: nobody needs fatelf
[20:23:02] <Dark_Shikari> mmmmm.  fat elf.
[20:23:04] <Honoome> distributions don't need fatelf, they'd be at best reducing their package archives' size but paying a huge price in traffic (and that's _much_ worse as youtube demonstrates)
[20:23:56] <Honoome> hardware manufacturers and proprietary software producers don't need fatelf because they'd be forcing their customers to use _extremely_ newer systems, and that's a huge cost
[20:25:16] <Honoome> single developers won't make use of it at all
[20:25:34] <Honoome> and embedded vendors will laugh if you tell them to waste more storage space for that crap
[20:26:09] <microchip_> fat elf? fat dwarfs ftw! :p
[20:26:13] <mru> proprietary vendors force you to use rhel3 anyway
[20:26:37] <Dark_Shikari> hmm this is a good point
[20:26:38] <Dark_Shikari> if we have fat elf
[20:26:42] <Dark_Shikari> we need fat DWARF too.
[20:27:00] <Dark_Shikari> and maybe fat wizards and fat orcs.
[20:27:07] <microchip_> yep :D
[20:27:33] <Honoome> mru: they couldn't do that anymore :P
[20:28:48] <peloverde> Why are we talking about fatelf still. I thought we agreed that it doesn't solve anything
[20:29:16] <mru> peloverde: _everybody_ agreed on that
[20:29:33] <Honoome> peloverde: sorry I needed to vent, an idiot over identi.ca insists I'm being illogical at dissing it
[20:30:35] <mru> stay off such sites
[20:30:50] <peloverde> I gave up on all those social sites except linkedin and facebook
[20:31:18] <mru> I browse the topics on HN because they're fairly frequently interesting
[20:31:22] <mru> the stuff they link to
[20:31:39] <mru> and the discussions are usually a notch above reddit and the like
[20:31:53] <Honoome> mru: it links to my trouble with ruby, sometimes it's the only way I have to contact upstream :/
[20:32:15] <peloverde> have they considered a bug tracker or a mailing list?
[20:32:29] <Honoome> hahahah
[20:32:33] <mru> Honoome: and that's not a warning sign?
[20:32:41] <Honoome> half the released gems don't have a repository :/
[20:32:50] <Honoome> more than half don't have tarballs
[20:32:57] <Honoome> about a third don't have versioned tags
[20:33:16] <mru> what are you still doing with it?
[20:33:27] <Honoome> mru: it's definitely a sign that ruby and rails attracted a bunch of wannabes that shouldn't be allowed to write code for a living so much as I'm not allowed to sing, for a living
[20:33:27] * peloverde never drank the ruby cool-aid
[20:33:56] <Honoome> mru: I like the language itself :/ â€” most of what I do, though, is standalone, such as ruby-elf
[20:34:09] <Honoome> but lately I'm still swamped in a work project I've already been paid to complete
[20:34:37] <peloverde> ruby is nice because the reference material comes bundled with porn
[20:34:45] <lu_zero> pfff
[20:34:51] <lu_zero> you meant rails
[20:35:00] <lu_zero> ruby is bundled with chunky bacon
[20:35:03] <lu_zero> and foxes
[20:41:36] <mru> if (!isnotcompressed) ...
[20:43:21] <lu_zero> uh?
[20:44:32] <CIA-99> ffmpeg: vitor * r23742 /trunk/libavcodec/mpegaudiodec.c: Remove pointless condition in #if
[20:45:06] <j0sh_> Honoome: but you have to admit, rails makes web dev pleasant. anything else is painful now
[20:45:32] <Honoome> j0sh_: trust me that if you look "under the hood", rails isn't less painful. at all
[20:46:27] <CIA-99> ffmpeg: vitor * r23743 /trunk/libavcodec/ (mpegaudiodec.c mpegaudiodec_float.c): Move float-specific function to mpegaudiodec_float.c
[20:46:32] <j0sh_> oh, i know. but when you keep the hood closed, it works well
[20:47:24] <Honoome> j0sh_: I can't do that :P
[21:01:49] <mru> Vitor1001: will you fix the warning about compute_antialias too?
[21:02:13] <Vitor1001> mru: Well, that's not really my fault.
[21:02:21] <Vitor1001> But yes, I can give a look
[21:02:21] <mru> no, but you'r working on the code
[21:02:40] <mru> ifdef and/or move to _float.c
[21:03:24] <Vitor1001> I imagine. I'll try to give a look after I finish benchmarking mp3lib dct32 :p
[21:03:34] <mru> no rush
[21:03:44] <mru> at least there are no VLAs there
[21:04:17] <mru> how many patches do I need to send before someone replies?
[21:04:37] <Vitor1001> What is so bad with VLA when they are not abused?
[21:04:44] <mru> any use is abuse
[21:04:59] <mru> they are unsafe and slow
[21:05:23] <Vitor1001> Why they need to be unsafe and slow?
[21:05:43] <mru> what happens if the size is outrageous?
[21:05:46] <mru> you die
[21:05:49] <Vitor1001> Of course.
[21:05:55] <mru> no chance of recovery
[21:06:08] <Vitor1001> But suppose that you check somewhere that size < 128, for ex.
[21:06:19] <Vitor1001> And use int buf[size] everywhere.
[21:06:21] <mru> then you might as well allocate 128 unconditionally
[21:07:06] <mru> if the maximum size is acceptable, there's no reason to not always use it
[21:07:22] <mru> since again, you can't catch an error
[21:07:31] <mru> so there's nothing to be gained from trying to save a few bytes
[21:08:44] <Vitor1001> Hmm...
[21:08:50] <Vitor1001> No memory fragmentation?
[21:08:58] <mru> it's the bloody stack
[21:09:02] <Vitor1001> int tab[10][size];
[21:09:03] <mru> it doesn't fragment
[21:09:06] <mru> that's even worse
[21:09:18] <mru> such a vla is much slower than a fixed-sized one
[21:09:39] <mru> since when indexing, you must multiply by a variable
[21:10:36] <Vitor1001> that's a good point.
[21:10:53] <mru> also, gcc can't inline a function containing a vla
[21:10:58] <mru> and you lose one register
[21:14:10] <peloverde> I wish C would relax what in considers a constant expression in regard to array sizing
[21:14:29] <mru> I used to say the same
[21:14:34] <mru> then I thought about it
[21:14:40] <mru> and realised it would be very hard to do
[21:21:15] <lu_zero> j0sh_ try turbogears2
[21:22:11] <lu_zero> if you get a growing dislike on rails automagic/implicit/unexpected tg2 usually is quite pleasant
[21:23:49] <CIA-99> ffmpeg: mru * r23744 /trunk/libavcodec/flacenc.c: flacenc: convert VLA to fixed size
[21:24:14] <mru> one down, many to go
[21:25:48] <lu_zero> post also benchmarks =P
[21:30:30] <Honoome> lu_zero: only if you can stand the fact that there is NO FRIGGING DOCUMENTATION :P
[21:30:51] <mru> Honoome: in that case, what are you doing here?
[21:31:15] <Honoome> mru: trust me, there's documentation here, compared to tg2
[21:32:55] <Honoome> lu_zero: I implemented FIONREAD/SIOCINQ for SCTP ...
[21:34:15] <pengvado> gcc can inline a function containing a vla, and the entirety of the calling function loses one register
[21:34:38] <lu_zero> \o/!
[21:35:26] <lu_zero> Honoome: tg2 has plenty of docs
[21:35:35] <lu_zero> scattered across 5-6 websites
[21:35:38] <lu_zero> but plenty
[21:35:39] <Honoome> in .py files
[21:36:00] <mru> pengvado: hmm, I've never seen it inline a vla
[21:36:11] <mru> and iirc I read somewhere it couldn't
[21:36:17] <mru> but I could have made that up
[21:36:44] <mru> empirically it clearly reduces inlining ability though
[21:36:45] <lu_zero> Honoome: tg2.1 actually has quite documented skels, jokes aside
[21:36:57] <Honoome> "finally"? :P
[21:37:54] <lu_zero> in the paster autogenerated stuff I mean
[21:38:01] <lu_zero> still
[21:38:47] <pengvado> you're right that gcc usually chooses not to inline vla functions unless you force it
[21:39:33] <mru> why did anyone ever thing they were a good idea?
[21:39:37] <mru> *think
[21:43:56] <Honoome> lu_zero: okay I sent the sctp patch to the two mailing lists, let's see if somebody respond to that
[21:44:07] <Honoome> if they do accept it, though, I'll have to work to the user code in feng :P
[21:46:19] <CIA-99> ffmpeg: stefang * r23745 /trunk/libavcodec/vp8.c: avoid conditional and division in chroma MV calculation
[23:45:38] <CIA-99> ffmpeg: mru * r23746 /trunk/libavcodec/snow.c: snow: remove unused parameter to mc_block()