[FFmpeg-devel-irc] IRC log for 2010-06-18

irc at mansr.com irc at mansr.com
Sat Jun 19 02:00:59 CEST 2010


[05:15:58] <wbs> _av500_: in the original opencore, there sure is some arm-opts for amr, but they're not necessarily enabled for compilation in libopencore-amr
[05:20:43] <wbs> Yuvi: ping, is this what you meant? http://albin.abo.fi/~mstorsjo/0001-libvorbis-Only-drop-1-byte-packets-at-end-of-stream.patch
[05:55:15] <thresh> moroning
[05:58:15] <av500> \o/
[06:00:40] <thresh> yes, yes
[06:28:52] <av500> kshishkov: send patch to convert comments to .se?
[06:42:37] <kshishkov> jag vill gärna göra det
[06:45:00] <av500> patchar välkomna
[06:45:10] <KotH> a wonderfull good morning to those living in switzerland!
[06:45:38] <av500> so, mostly to german people...
[06:46:36] * KotH doesnt care much about the big canton in the north ;)
[06:46:55] <av500> i thought you are getting overcrowded with germans, no?
[06:47:28] <KotH> i didnt say i didnt care about the people comming to .ch to eat all my chocolate
[06:48:05] * elenril thinks mornings are overrated
[06:51:53] <benoit-> moin
[06:53:04] <benoit-> KotH: A spanish friend of mine told me she was going to throw away all the swiss chocolate she had :)
[06:55:10] <KotH> benoit-: must have gone bad in the climate there ;)
[06:55:36] <elenril> how can chocolate go bad?
[06:55:44] <elenril> chocolate is eternal
[06:56:35] <KotH> it can, blieve me it can
[06:57:08] <KotH> (though you need to do some pretty bad ass things to it)
[07:26:42] <twnqx> shouldn't the topic be updated in both channels to 0.6 has been released?
[07:30:47] <KotH> details!
[07:32:32] * kshishkov wonders why development channel should care about releases
[07:37:45] <av500> the topic is too long anyway
[07:37:57] <av500> i would drop the release bit from -devel
[07:38:13] * KotH takes out his katana and slices the topic into mouth sized pieces
[07:38:31] * kshishkov would prefer "FFmpeg development channel. If you want to talk about anything else, you're unwelcome"
[07:38:47] <wbs> that's short and concise at least :-)
[07:39:05] <av500> kshishkov: damn, what about european railway systems?
[07:40:06] <kshishkov> av500: Ukrainian sucks, German sucks less and has comfortable expresses, Swedish is the most comfortable if not that fast
[07:40:07] <KotH> kshishkov: or strange languages?
[07:40:24] <KotH> kshishkov: or how $othersoftware sucks?
[07:40:25] <wbs> KotH: hey, swedish isn't that strange ;P
[07:40:26] <kshishkov> KotH: like Romansh?
[07:40:42] <KotH> kshishkov: rather like lojban ;)
[07:41:57] <kshishkov> wbs: Swedish is mostly German without complications and more pleasant sounding. And tendency not to make words too long
[07:46:35] * pJok makes kshishkov speak skånska
[07:46:37] <pJok> ;)
[07:55:44] <CIA-98> ffmpeg: cehoyos * r23641 /trunk/libavformat/spdif.c:
[07:55:44] <CIA-98> ffmpeg: Add IEC958 data_types for Atrac* and WMA Pro.
[07:55:44] <CIA-98> ffmpeg: Data-burst is described in IEC 61937-7 (Atrac) and IEC 61937-8 (WMA Pro).
[07:57:24] <merbzt> http://tranquillity.ath.cx/clang/2010-06-17-1/
[07:59:42] <andoma> does anyone know of a receiver that is capable of WMA over SPDIF?  or even AAC?
[07:59:57] <merbzt> pioneer
[08:00:03] <merbzt> had 3 versions
[08:00:18] <merbzt> I have a patch that was tested against one
[08:02:05] <merbzt> wmapro over spdif actually exists
[08:02:12] <merbzt> atrac not
[08:06:17] <kshishkov> are you sure?
[08:06:45] <kshishkov> maybe nobody just has that SOny equipment to verify if it exists in reality?
[08:07:12] <andoma> i'm a bit puzzled that most receivers does not support AAC ..
[08:07:38] <andoma> AFAIK it will be more and more common in HDTV broadcasts
[08:08:16] <av500> chicken/egg
[08:08:44] <merbzt> maybe cos the encoded stream is broken
[08:09:00] <merbzt> er multichannel aac is a mess
[08:10:30] <kshishkov> why? it's all nicely sorted out - many channels or channel pairs, all with the same ID if you're lucky
[08:10:47] <merbzt> :(
[08:12:16] <merbzt> it's a nice mess
[08:12:30] * kshishkov dabbled a bit in AAC
[08:12:45] <merbzt> you don't say ...
[09:37:59] <lu_zero> Fabio is here asking about the website template
[09:38:09] <lu_zero> which one you like best?
[09:38:23] <kshishkov> what, we'll get website template in addition to nice pictures?
[09:43:20] <mru> Honoome: ping
[09:43:35] <mru> or lu_zero
[09:43:51] <mru> do either of you know anything about the alsa ebuilds in gentoo?
[09:52:48] <DonDiego> bye
[09:59:17] <lu_zero> mru: hi
[09:59:25] <lu_zero> Honoome: should know a lot since he started them
[09:59:33] <lu_zero> what's up?
[09:59:51] <mru> what's the purpose of the ALSA_PCM_PLUGINS setting?
[09:59:57] <mru> if I disable anything it refuses to run
[10:00:10] <lu_zero> anything?
[10:00:13] <lu_zero> it shouldn't
[10:00:25] <mru> it spits errors about missing symbols
[10:00:56] <benoit->  /topic welcome to #gentoo-users :D
[10:01:02] <mru> I neither need nor want most of those
[10:01:38] <lu_zero> give me more details
[10:01:43] <lu_zero> I could try myself now
[10:02:31] <spaam> mru: do you have problems with gentooo? :)
[10:02:38] <mru> spaam: no, with alsa
[10:03:11] <ohsix> mru: you need to audit the configs it comes with if yuo're gonna be ditching some modules; a lot are used implicitly to give nice user labels for device names and functionality
[10:03:27] <mru> ohsix: stay out of this
[10:03:40] <ohsix> just saying; they'll go out of sync
[10:03:41] <lu_zero> spaam: alsa is a pain
[10:04:23] <spaam> lu_zero: it did work good for me back in 2004-2007 :)
[10:04:47] <lu_zero> spaam: once you do not start messing with hda, pulse and try to thin it down
[10:05:39] <spaam> ok. does ubuntu have this problem also with pulse? :)
[10:06:41] <mru> ALSA lib dlmisc.c:118:(snd_dlsym_verify) unable to verify version for symbol _snd_pcm_empty_open
[10:06:44] <mru> ALSA lib pcm.c:2175:(snd_pcm_open_conf) symbol _snd_pcm_empty_open is not defined inside [builtin]
[10:06:47] <mru> Playback open error: -6,No such device or address
[10:07:25] <lu_zero> spaam: ubuntu had and has problems with pulse
[10:07:29] <lu_zero> some self-inflicted
[10:07:36] <mru> speaker-test: pcm_plug.c:67: snd_pcm_plug_close: Assertion `plug->gen.slave == plug->req_slave' failed.
[10:07:40] <mru> Aborted (core dumped)
[10:07:46] <ohsix> they fixed most of the egregious things in 10.04
[10:07:48] <lu_zero> some due lennart ideas
[10:08:21] <ohsix> now they don't disable flat volumes and rtkit is there, and they pick patches from the stable tree like they're supposed to
[10:08:57] <mru> "pick patches ... like they're supposed to" <-- something's not right there
[10:09:12] <ohsix> was someone having a problem with pulse? i didn't follow what lu/spaam were saying
[10:09:28] <mru> ohsix: no, alsa refuses to work with useless plugins disabled
[10:09:31] <lu_zero> http://bugs.gentoo.org/show_bug.cgi?id=186365 <- mru that's related
[10:09:42] <ohsix> then why did spaam mention pulse
[10:09:43] * lu_zero is discussing the website with fabio at the mean time
[10:09:50] <mru> I only care about plain pcm playback
[10:10:08] <mru> _maybe_ with plug for the odd case that doesn't work otherwise
[10:10:17] <ohsix> also; the default configs use almost all of those useless plugins, speaker-test isn't going to work without the surround* and front, side, what have you names; those are in the config and slave pcms and stuff
[10:10:34] <mru> why is the default so fucked up then?
[10:10:54] <mru> it does in fact run with -Dhw:0 -c2
[10:11:00] <mru> apparently the hw doesn't like mono
[10:11:06] <mru> or alsa thinks it doesn't
[10:11:15] <lu_zero> sigh
[10:11:21] <lu_zero> alsa is overly complex
[10:11:26] <mru> no kidding
[10:11:30] <av500> :)
[10:11:36] <lu_zero> pulse is trying to hide that parts
[10:11:42] <mru> failing badly
[10:11:45] <lu_zero> (adding complexity and brain damage)
[10:11:50] <spaam> better to use oss? :)
[10:11:51] <av500> elefant hiding behind tree?
[10:11:56] <lu_zero> av500: no
[10:12:07] <lu_zero> hiding a forest behind an elephant
[10:12:14] <lu_zero> a pink one obviously
[10:12:17] <ohsix> pulse isn't hiding anything; just giving a uniform ui for picking devices and where streams should go, in light of devices coming and going
[10:12:26] <av500> isnt alsa doing that?
[10:12:52] <mru> that's what ohsix said yesterday
[10:13:00] <mru> but he's been known to be inconsistent before
[10:13:09] <ohsix> har har
[10:15:04] <ohsix> well i eat my own dogfood, and i don't have any problems with doing so, being able to play stuff at the same time without resorting to dmix is pretty gr8; and apps don't break with fragile fragment/buffer sizes when the runtime circumstances of the computer changes, and it minimizes latency \m/
[10:15:52] <mru> I'd like to be absolutely certain dmix is never used
[10:15:59] <mru> it's crap and I don't trust it
[10:16:11] <mru> adding ipc to random apps is never a good idea
[10:16:17] <lu_zero> mru: which device are you using?
[10:16:47] <mru> hda-intel hardware
[10:16:53] <ohsix> the configs are set up to use dmix if the device is single stream; default goes to dmix which slaves plug:hw
[10:16:54] <mru> I'd like to be able to use plughw in alsa
[10:17:02] <mru> for the rare cases when something isn't supported
[10:17:15] <av500> how do I make <random_sw> use pa?
[10:17:15] <mru> ohsix: AND I DON'T WANT THAT
[10:17:33] <ohsix> and thats fine; you'll just have to erase all the upstream configuration files heh
[10:17:39] <ohsix> i'm just telling you how it is
[10:17:57] <mru> alsa configs are worse than sendmail
[10:18:18] <kshishkov> but less flexible
[10:18:21] <ohsix> at least they don't hit m4 :]
[10:18:25] <ohsix> less?
[10:19:02] <mru> m4 is not required for sendmail
[10:19:06] <mru> I've configured it without
[10:19:21] <ohsix> i know, its just an old canard; it was a joke
[10:19:29] <spaam> mru: you dont use sendmail ? :O
[10:19:39] <mru> sendmail syntax isn't all that complex, just cryptic
[10:19:44] <mru> spaam: not anymore
[10:19:49] <mru> I did a long, long time ago
[10:20:20] <spaam> why did you change? :)
[10:20:29] <ohsix> i always figured they'd use m4 so they could insulate possibly old configs from internal changes
[10:20:31] <mru> postfix is nicer
[10:20:59] <mru> m4 is just a way to provide common templates
[10:21:16] <ohsix> mru: the default configs are quite extensive, they're in /usr/share/alsa
[10:21:25] <mru> I know where they are
[10:21:45] <mru> oh screw this
[10:21:47] <ohsix> they do things like add softvol to devices with no hw attenuators at the right places too
[10:21:49] <mru> I've better things to do
[10:21:57] <mru> I don't need softvol
[10:22:03] <lu_zero> once Honoome will appear he will be able to help better
[10:22:16] <mru> it's just a laptop anyway
[10:22:22] <KotH> mru, ohsix: are you at it again?
[10:22:30] <mru> although I'd like to configure the desktop sanely too
[10:22:34] * KotH blames spaam 
[10:22:39] <ohsix> i know, just saying, there is a _ton_ of real work they put into those configs to normalize a lot of hardware
[10:22:49] <mru> ohsix: shut up please
[10:23:06] <spaam> KotH: noo. go back and sleep :O
[10:23:17] <ohsix> software is written against those labels too; so random stuff will break if you kneecap it
[10:23:31] <mru> I write my own software
[10:23:33] <ohsix> not trying to dissuade you or anything; don't get mad
[10:24:35] <siretart> SCNR: http://www.osscc.net/en/licenses.html#compatibility
[10:25:22] <ohsix> KotH: nah i have the luxury of something to do today :] (but a toothache!)
[10:25:44] <mru> ohsix: see, that's what you get when you troll too much
[10:25:59] <ohsix> nah i've had the tooth ache longer
[10:26:04] <mru> you've been trolling a long time
[10:26:22] <av500> siretart: yes, one guy at linuxtag hinted us that this is to come
[10:26:25] <ohsix> nah
[10:26:30] <mru> clarification: too much for your skill level
[10:26:45] <ohsix> i was excited to see you poking at alsa to see what you can be satisified with
[10:27:30] <mru> I'd be satisfied if it all went away
[10:27:49] <mru> I'd be happy if it did so with a huge boom
[10:27:52] <siretart> av500: I hope that guy wasn't shily himself!
[10:27:55] <ohsix> the matter of what plugins you use aren't with which ones you build, its with what the configuration lumps together for whartever  the software was doing
[10:28:41] <mru> if all I'm trying to do is play a single pcm stream the hw can handle directly, why doesn't it just do that?
[10:28:44] <ohsix> i think you even need a plugin to play on a single channel when you can only open a device in stereo or more channels, too; dunno the name of it though (might be plughw)
[10:28:53] <ohsix> it does do that,  if you open hw:
[10:29:10] <mru> not by default
[10:29:21] <mru> by default it pulls in all manner madness
[10:29:24] <ohsix> and if you open default: like most software should, it might get ornery with dmix and whatnot, trying to expose a uniform interface to the software
[10:29:43] <mru> that's why I disabled dmix and other junk
[10:29:49] <ohsix> ya, it does; software that opens hw: is considered specialized, rarified even
[10:29:52] <av500> siretart: no, some debian guy
[10:30:01] <av500> siretart: scnr: http://imagebin.ca/view/g2DJYP.html
[10:30:01] <mru> av500: same thing, different colour
[10:30:24] <siretart> av500: LOL
[10:30:38] <ohsix> the "device" can even have plugins and stuff in it too; its kind of ugly (like with speaker-test, -Dfile,butt.wav will slave in the file pcm and write it to the parameter)
[10:31:12] <mru> the default should do the best it can with the available plugins
[10:31:29] <ohsix> if you want no bullshit though, open hw:; if software is going to work with other software at the same time and make use of some default labels (left, right, center; front, that sort of stuff) use default: or that label
[10:31:31] <mru> not dump core the moment something is missing
[10:31:56] <ohsix> well the configuration isn't conditional on plugins present; they have a small set that are essential and they write the configs to it
[10:32:11] <ohsix> missing a slave plugin _is_ akin to a null pointer deref when you try and connect to it
[10:32:16] <mru> dmix sure as hell ain't essential
[10:32:31] <mru> and dumping core is always evil
[10:32:58] <ohsix> nope; unless a user expects to play stuff at the same time, which a lot do, even though dmix is awful and should never need to exist (alsa devs concur)
[10:33:12] <elenril> what, yet another alsa flame?
[10:33:31] <mru> if dmix isn't enabled, the first to open the device should get it, later ones whould fail gracefully
[10:33:45] <mru> now even the _first_ one dumps core
[10:33:48] <ohsix> i h8 dmix; you can't even pick parameters that would work without fail or without huge latency, its very not real time :[
[10:34:00] <mru> I don't know if that's due to dmix or something else
[10:34:11] <ohsix> mru: dmix isn't "disabled" by not building it though, the config files still slave it
[10:34:23] <mru> and that's the flaw
[10:34:33] <lu_zero> indeed...
[10:34:51] <ohsix> you should be able to move pcm/dmix.conf out of the way and it will work as if dmix wasn't there
[10:35:16] <ohsix> its not really a flaw; the configs go with the software, they're for integrators or not to be touched :<
[10:35:51] <mru> why do we need some mystical integrator to fix everything?
[10:36:00] <ohsix> you don't have one without the other, and you don't have all the fancy labels that software uses without them at all, you just have hw: (which you could open in software regardless of their presence)
[10:36:01] <mru> can't it just be made properly to begin with?
[10:36:10] <ohsix> they don't, and it is properly "made" from upstream
[10:36:21] <mru> and I don't need any "fancy labels"
[10:36:28] <ohsix> they do if they want the default behaviour to be different
[10:36:48] <ohsix> you may not need it but software already uses them, your software might not, but they aren't for your software
[10:37:10] <mru> I want to build the bare minimum that works with *my* software
[10:37:19] <mru> my software opens whatever I tell it to
[10:37:31] <ohsix> then just open hw:, speaker-test uses the labels
[10:38:03] <ohsix> some plughw stuff is used implicitly but it shouldn't be crashing in your own software if you're using hw: and none of the extra configuration labels
[10:38:10] <mru> but drop this, I have more important things to do
[10:38:28] <mru> speaker-test should _never_ dump core
[10:38:31] <mru> but it does
[10:38:38] <ohsix> tell the alsa developers
[10:38:45] <mru> I doubt they care
[10:39:27] <ohsix> if you remove software it depends on what do you expect? (they're slaved in asound based on the config, it could check the functors beyond just warning about them; but you'd need to contribute that, or tell the alsa developers to add it)
[10:39:56] <mru> I shouldn't need to tell them that software shouldn't dump core
[10:40:01] <ohsix> if they feel a normalized config comes with all the plugins enabled i doubt they're going to cater to each one of them possibly being not present
[10:40:04] <mru> what are they, java coders?
[10:40:31] <ohsix> thats a canard :[ it shouldn't dump core, but it also shouldn't be run in an incomplete manner; its not top trumps
[10:40:35] <mru> any configuration that can be built must run without crashing
[10:40:46] <mru> failing with a sensible error message is of course ok
[10:40:56] <mru> or even failing with a weird error message
[10:41:01] <mru> BUT NOT DUMPING CORE
[10:41:06] <ohsix> can it be built? you were setting an internal variable weren't you?
[10:41:18] <mru> of course it can be built
[10:41:23] <mru> how the fuck did you think I got it?
[10:41:40] <mru> just some --disable flags to configure
[10:42:05] <ohsix> well thats nice, you still have the matter of the config files
[10:42:22] <mru> which are flawed
[10:42:37] <ohsix> how? not working how you expect doesn't mean its inherently flawed
[10:43:10] <kshishkov> but not working at all does
[10:43:28] <ohsix> i can jam a screwdriver into my motherboard but that doesn't mean i'm not the one responsible for doing something silly
[10:44:03] <ohsix> well we're at an impasse again; you think its dumb or wrong but wont affect any change regarding it
[10:44:08] <mru> but if changing a bios setting makes it blow up in flames, I'd call it a flaw
[10:44:34] <ohsix> flames are relative; if i change the voltage on my cpu it isn't going to be happy, but its right there in the bios
[10:44:52] <ohsix> i get your point though
[10:45:11] <mru> if a particular configuration can never work, the build system shouldn't offer it
[10:45:14] <mru> simple as that
[10:45:35] <ohsix> well what you're building does work; but it is not complete without the kernel, the configs, and the software using it
[10:45:44] <mru> it's as if ffmpeg dumped core if built without the bink decoder
[10:45:52] <mru> even if only decoding mpeg2
[10:45:57] <ohsix> not really
[10:46:18] <elenril> \o/
[10:46:27] <mru> I'll remove the ban in a few hours
[10:47:00] <elenril> what do you use for mixing then if not dmix?
[10:47:05] <mru> I don't
[10:47:08] <mru> I play one thing at a time
[10:47:19] <elenril> :/
[10:47:57] <kshishkov> and silly sound notifications from programs are better to be turned off anyway
[10:48:32] * elenril doesn't use silly notifications
[10:48:43] <elenril> but e.g. flash likes to grab the soundcard for itself
[10:48:51] <elenril> inb4 don't use flash then
[10:49:29] <kshishkov> that's obvious
[10:50:01] * thresh uses flash
[10:51:49] * kshishkov takes a pity on thresh
[10:52:08] <av500> in .ru flash uses you
[10:52:22] <mru> I thought that was everywhere
[10:52:47] <kshishkov> mru: maybe except Adobe HQ
[11:31:54] <wbs> kshishkov: I have a patch for your reviewal. :-)
[11:32:46] <kshishkov> ok
[11:40:16] <lu_zero> mru: the bink decoder is a key component of ffmpeg
[11:40:43] <lu_zero> everybody wants to play any kind of video by transcoding to bink and then playing it
[11:41:29] <av500> i would vote to make that default
[11:43:52] <kshishkov> not on my watch
[11:44:04] <lu_zero> instead of vp8?
[11:44:07] * kshishkov dislikes Bink DCT
[11:44:30] * Compn is the last person waiting for vivo support
[11:44:36] <Compn> ehe
[11:47:03] <kshishkov> Compn: troll mru to get it supported or be trolled out of that stupid idea
[11:58:16] <thresh> OT: is there a way to force youtube not to recode your HD video ?
[11:58:29] <thresh> other than buying 50%+1 it's stock
[11:58:43] <mru> bribe someone
[12:01:32] <KotH> thresh: at least it's being reencoded with ffmpeg
[12:02:03] <Compn> i cant even find a contact at youtube to ask about potential samples
[12:02:23] <Compn> my google contacts havent been having good luck communicating with them either :\
[12:03:03] * KotH isnt surprised
[12:03:53] <CIA-98> ffmpeg: mstorsjo * r23642 /trunk/libavformat/rtmpproto.c:
[12:03:53] <CIA-98> ffmpeg: RTMP: Return from rtmp_read as soon as some data is available
[12:03:53] <CIA-98> ffmpeg: Earlier, the function only returned when the enough data to fill the
[12:03:53] <CIA-98> ffmpeg: requested buffer was available. This lead to high latency when receiving
[12:03:53] <CIA-98> ffmpeg: low-bandwidth streams.
[12:04:50] <thresh> someone should inject a bytecode sequence that will trigger ffmpeg to do vcodec/acodec copy, and then produce videos with that sequence
[12:08:15] <av500> \\\ooo///
[12:10:39] <av500> thresh: otoh, not recoding means to output potentially malicious stream 1:1 to millions of users...
[12:10:56] <mru> yay!
[12:11:01] <thresh> av500: so, win-win.
[12:11:08] <mru> ffbotnet
[12:11:11] <thresh> ffmpeg world domination task accomplished.
[12:11:45] <av500> mru: for next LT we should put tiny parts of BBB rendering into each ffmpeg run...
[12:11:59] <mru> hehe
[12:48:32] <av500> gee, Koleszar found a new use case for the "invisible" bit....
[12:48:53] <av500> MKV edit lists
[12:49:04] <av500> mark all frames outside of range as invisiable...
[12:54:29] <lu_zero> uhmA?
[12:54:47] * lu_zero wonders what's the exact problem there
[12:55:12] <lu_zero> still I like the ARF name
[13:43:50] <Tjoppen> whoa. ffplay seeks to a percentage of where along the width of the window you press? never noticed that before
[13:46:44] <kshishkov> it does so since the beginning, I think
[13:54:20] <av500> Tjoppen: yeah, I found that out only recently too
[13:54:28] <Tjoppen> ok. I always wondered what the logic behind its seeking way
[14:00:17] <mru> you guys don't read the source?
[14:00:34] <mru> shocking
[14:16:05] <Tjoppen> hehe :)
[14:16:16] <Tjoppen> hemma \o/
[14:43:58] <BBB> Dark_Shikari: is pshufw particularly slow?
[14:48:08] <BBB> hm, I guess I mismeasured
[14:48:18] <BBB> I have another 10% speedup by doing crazy pshufw magic
[14:48:28] <BBB> also useful for the sixtap
[14:57:26] <lu_zero> wonderful
[14:57:36] <lu_zero> somebody wrote me in chinese about feng
[14:57:53] <lu_zero> I just managed to take the text and thunderbird ate the email...
[15:03:12] <wbs> BBB: time to give some opinion on the rtsp/http tunnel auth thread? mainly, is it ok for ff_url_join() to behave the same if auth is NULL and auth is ""?
[15:04:57] <lu_zero> wbs: uh?
[15:05:04] <lu_zero> why that?
[15:05:31] <wbs> otherwise we'd have to do ff_url_join(... auth[0] ? auth : NULL, ...) in the rtsp code
[15:05:54] <wbs> skipping it perhaps would be ok, too, but then we'd pass a http://@server/ url to the http protocol, which looks funny
[15:06:35] <lu_zero> looks like I'm missing something
[15:07:20] <lu_zero> are you sure that makes sense put the auth in the http tunnel?
[15:07:46] <wbs> yes, I've been testing it with the private urls that stas oskin provided me with
[15:08:01] <wbs> the http protocol doesn't do anything with the auth unless the server actually responds with 403
[15:08:35] <lu_zero> so we have to move the auth stuff back and forth rtsp and http
[15:08:43] <wbs> umm, no
[15:09:00] <wbs> if the user specified auth for the rtsp url, we're not sure if it will be needed at the rtsp level or on the http tunnel level
[15:09:05] <wbs> so we just add it to the http tunnel urls
[15:09:29] <lu_zero> ok
[15:09:37] <wbs> then _if_ the http tunnel requests get a 403, the http protocol handler will retry using the auth credentials found in the url
[15:10:05] <wbs> likewise, if any of the rtsp requests (tunneled or not) get a 403, we retry using the auth that we were provided. if not, we never send the auth credentials out
[15:10:48] <lu_zero> ok
[15:11:19] <wbs> ..., so, if no rtsp auth was provided, the auth[42] buffer in the rtsp code will be just an empty string
[15:12:52] <lu_zero> hi BBB
[15:13:04] <wbs> so when creating the http tunnel url, we'd pass this auth buffer to ff_url_join(), but either add code to ff_url_join() to omit the auth part if the string is a non-null, but empty string. or make the ff_url_join() call contain ..., auth[0] ? auth : NULL, ...
[15:13:06] <BBB> hello
[15:24:16] <BBB> my sixtap is bitexact for half of my samples, but not for the other half
[15:24:20] <BBB> that's a little frustrating
[15:33:21] <BBB> ah, found it
[15:34:25] <BBB> Dark_Shikari: what do you think of http://ffmpeg.pastebin.com/eGkAPF8R as 6tap filter? it's 4x as fast as C and about 5 less instructions (plus 3 instead of 6 memory accesses inside the loop) compared to the one from libvpx
[15:35:33] <BBB> I'm still wondering if I can prevent the memory access to [ff_pw_64] by saving a reg somehow... ideas welcome :)
[16:01:46] <lu_zero> ff_pw_64?
[16:05:30] <lu_zero> libavcodec/x86/dsputil_mmx.c:DECLARE_ALIGNED(16, const xmm_reg,  ff_pw_64 ) = {0x0040004000400040ULL, 0x0040004000400040ULL};
[16:05:33] <lu_zero> ok
[16:05:33] <lu_zero> uhm
[16:05:50] <lu_zero> I guess you have just 8 regs
[16:05:54] <BBB> I'm doing crazy stuff and I know little about it ;)
[16:06:22] <BBB> I want 1 zero reg, 3 filter constant regs and one reg for the ff_pw_64, so I have 3 regs to calculate, I don't think that's enough
[16:06:35] <BBB> so I free one by accessing ff_pw_64 directly, but that's likely slower
[16:07:14] <lu_zero> xmm0 isn't 0 by default in the operations you need a 0 ?
[16:07:22] * lu_zero wonders since he doesn't know
[16:07:35] * lu_zero points that altivec has some more regs available...
[16:07:36] <lu_zero> hmm
[16:07:40] <BBB> I use mm%d regs, not xmm%d
[16:07:48] <BBB> it's mmx only, for now
[16:07:52] <lu_zero> mm0 then
[16:07:52] <BBB> I learn baby-steps
[16:08:01] <BBB> I use mm0 for calculations :-p
[16:08:38] <Honoome> lu_zero: yeah ppc has more regs, but its code is encrypted by default
[16:09:04] <lu_zero> Honoome: pff
[16:09:37] <lu_zero> http://ffmpeg.pastebin.com/eGkAPF8R <- that wouldn't be any harder to read translated in vmx+ppc asm
[16:10:21] <Honoome> lu_zero: depends.. I prefer "one symbol one meaning" to "a combination of two to four symbols half a meaning"
[16:10:28] <Vitor1001> BBB: I'm a really asm noob, but is mm7 always 0?
[16:10:46] <BBB> Vitor1001: I pxor it in the beginning, and then use it as my zero constant
[16:10:56] <BBB> I need that for unsigned byte->word conversions by abusing punpck
[16:11:18] <BBB> and I'm probably more asm n00b than you :-p
[16:11:21] <Vitor1001> I understand, but maybe you can trade someway a extra register by an extra xor.
[16:11:37] <Vitor1001> I mean, you use mm7 as a temp reg and clear it afterwards
[16:11:45] <BBB> got it, that might work
[16:14:49] <BBB> would that be faster?
[16:15:04] <BBB> one extra pxor, one memory access replaced by a register access
[16:15:25] <BBB> my measurements have a lot of noise so it's hard to show convincingly
[16:15:57] <lu_zero> BBB: uhm
[16:16:07] <lu_zero> check which is the average latency of a load
[16:16:31] <lu_zero> and check how many arith ops you issue at the same time
[16:16:31] <BBB> I don't even know what that means
[16:16:52] <lu_zero> if you have a load it will take X cycles before it is ready
[16:17:12] <lu_zero> but you can do something while it is loading
[16:17:16] <lu_zero> e.g
[16:17:28] <lu_zero> load r1 memory
[16:17:38] <lu_zero> add r2 r3 r4
[16:17:56] <lu_zero> mul r4 r5 r6
[16:18:41] <lu_zero> add r7 r1 2
[16:19:01] <lu_zero> if the load takes less than the time of your add and the mul
[16:19:01] <Vitor1001> BBB: Even with START_TIMER / STOP_TIMER() you have a lot of noise?
[16:19:14] <BBB> a little, like 5% or so between runs
[16:19:15] <BBB> ues
[16:19:16] <BBB> yes
[16:19:25] <BBB> maybe it's because I'm watching worldcup matches at the same time :-p
[16:19:29] <Vitor1001> ;)
[16:19:36] <lu_zero> your load won't cost as much as having the cpu waiting for the r1 value
[16:19:56] <lu_zero> obviously you have to do the same for every operation
[16:20:17] <BBB> lu_zero: I'll play with instruction order, I'm hoping to get rid of all loads though
[16:20:51] <lu_zero> there are profilers helping spotting stalls
[16:27:52] <BBB> Vitor1001: I tried, it works, but it's really about the same speed...
[16:29:24] <BBB> I moved a load and now it's a lot faster from ~2850 to ~2750 cycles for the whole thing)
[16:29:29] <BBB> not bad
[16:30:44] <BBB> that's still 4x faster than the C code :-)
[16:31:09] <lu_zero> you are crunching 4x the number of bytes
[16:31:22] <lu_zero> so it's pretty much what you'd expect
[16:32:15] <BBB> http://ffmpeg.pastebin.com/3ZkJBGqX
[16:32:30] <BBB> had to rearrange a few variables for the register-save to work
[16:33:13] <lu_zero> # paddd     mm0, mm3                     ; add to 2nd 2px cache
[16:33:13] <lu_zero> # pxor      mm3, mm3
[16:33:13] <lu_zero> # punpcklbw mm2, mm3                     ; byte->word FGHI
[16:33:25] <lu_zero> doesn't look nice
[16:33:29] <BBB> ?
[16:33:42] <BBB> mm3 is my zero variable, I need to clear it before reusing it as such
[16:35:35] <lu_zero> probably if you use a different register and move the pxor on instruction far from punpcklbw
[16:35:44] <lu_zero> it might get faster
[16:35:50] <BBB> yeah, I see what you mean, but that's hard
[16:35:54] <BBB> because I'm reg-starved
[16:36:10] <BBB> 4/5/6/7 cannot be touched
[16:36:14] <BBB> 2 is taken for the load
[16:36:19] <BBB> 1 is the final result
[16:36:26] <BBB> 1/3 need to be added as intermediate products
[16:36:34] <BBB> so I need to use 3 and need it as a zero right after
[16:36:51] <BBB> 1/3 = 0/3 of course
[16:37:46] <BBB> if you see an obvious way to do it I'd of course try :)
[16:41:31] <Vitor1001> BBB: How many times the loop run?
[16:41:36] <BBB> 4
[16:42:01] <Vitor1001> Ok, because I think you can get the dec r3 out of the main loop...
[16:42:15] <BBB> ?
[16:42:46] <Vitor1001> you could start with a negative value of r1
[16:42:53] <Vitor1001> and increase it until it reaches 0
[16:43:11] <Vitor1001> Ow, scrap that, that's the pointer ;)
[16:43:20] <BBB> :-p
[16:43:55] <BBB> r0 is dest-src, r1=src
[16:44:59] <BBB> I could change r4 into stride*(h-1)
[16:45:10] <BBB> and then loop backwards instead of forward
[16:45:16] <BBB> but I doubt that'd be faster
[16:45:23] <Vitor1001> I see...
[16:45:25] <twnqx> a loop of length 4 sounds almost like "unroll me" if it saves a register
[16:45:35] <BBB> (and then access r0+r4 and r1+r4 instead of the current way
[16:46:10] <lu_zero> BBB: try unrolling
[16:46:19] <BBB> Dark_Shikari told me not to :-p
[16:46:27] <lu_zero> unroll+remap should let you avoid stalls
[16:46:37] <lu_zero> uhmm
[16:46:38] <BBB> I can try the macro way I guess
[16:46:54] <lu_zero> cache boundaries?
[16:46:57] <BBB> doesn't it quadruple codesize?
[16:47:03] <twnqx> yes.
[16:47:39] <lu_zero> so if you have a cache miss it might get way slower
[16:47:51] <kierank> BBB: you can join #x264dev and ask holger if you want more asm help if Dark_Shikari's not around
[16:47:58] <lu_zero> worth trying just for educational purpose
[16:48:11] <BBB> ok
[16:48:16] <BBB> kierank: who's he?
[16:48:24] <BBB> isn't pengvado also an asm god?
[16:48:41] <kierank> holger wrote some of x264's asm magic
[16:48:54] <lu_zero> siretart: ffmpeg 0.6 is already ubuntu 10.04 ?
[16:49:03] <kierank> with ridiculous speedups
[16:49:25] * lu_zero is baking a dummy-proof box
[16:50:41] <BBB> nah, same speed
[16:51:23] <BBB> thanks for the idea though :)
[16:51:39] <lu_zero> let me see the code
[16:52:03] <BBB> http://ffmpeg.pastebin.com/3ZkJBGqX
[16:52:35] <pengvado> pxor is latency 1. you don't really need to hoist it.
[16:53:02] <lu_zero> # movq      mm1, mm2                     ; byte ABCD..
[16:53:04] <lu_zero> uhm?
[16:53:11] <BBB> I need mm2 later
[16:53:15] <BBB> so I need a copy
[16:53:32] <BBB> (for CDEF/EFGH)
[16:53:40] <BBB> for mm1, I only care about ABCD
[16:53:44] <BBB> maybe I can make it a movd
[16:53:50] <BBB> is that faster?
[16:54:00] <lu_zero> so you have mov+mov
[16:54:11] <lu_zero> pengvado: which is the load latency?
[16:54:55] <BBB> compiler complains that I can't use mov/movd on two mm registers
[16:55:02] <BBB> unfortunate :-(
[16:56:52] <BBB> let me try the vertical 4-tap function
[16:56:57] <BBB> that should be easy also
[16:57:13] <pengvado> load latency is 3 cycles from L1, 15 from L2, or 300 from main memory.
[16:57:19] <BBB> Yuvi: can I commit this or would you like to merge the plain vp8 decoder without asm first?
[16:58:32] <lu_zero> ah
[16:59:21] <lu_zero> what about paddd ?
[16:59:41] <pengvado> 1/.5
[17:20:13] <enkidu> hello
[17:20:57] <enkidu> Dark_Shikari: what did you do in ffmpeg 0.6 x264 decoder, that it is able to play smoothly 720p on Atom processors?
[17:21:49] <BBB> there is no x264 decoder
[17:21:59] <enkidu> h264
[17:22:01] <elenril> x264 decoder? in my ffmpeg?
[17:22:03] <enkidu> sorry
[17:22:31] * enkidu is after 12 hours of real life...
[17:22:32] <elenril> and it wasn't D_S who did it
[17:22:49] <enkidu> so who did this opts?
[17:22:53] <janneg> it was Micheal
[17:23:02] <kierank> atom sucks
[17:23:16] <kierank> use the hardware acceleration that's probably present
[17:24:00] <enkidu> VO: [xv] 1280x720 => 1280x720 Planar YV12  [zoom]
[17:24:20] <av500> enkidu: 1080p plays find on atom
[17:24:31] <av500> if you use the HW decoder in the chipset... :)
[17:24:36] <enkidu> yeah...
[17:24:46] <mru> av500: that's not playing on atom
[17:25:04] <enkidu> but probably my is not featured with hw decoder
[17:25:15] <kierank> go and find out
[17:25:15] <av500> enkidu: but it was cheap! :)
[17:25:27] * mru doesn't believe in netbooks
[17:26:35] <Dark_Shikari> BBB: unrolling doesn't help speed unless it lets you save instructions
[17:26:42] <Dark_Shikari> Not on an OOE arch, that is.
[17:27:18] <BBB> I reduced the number of calls by another 2 or 3 on the function we did yesterday by doing pshufw, and limited to 1load per loop iteration
[17:27:28] <BBB> I'll let you see in a bit
[17:28:10] <Dark_Shikari> keep in mind pshufw is mmxext, so mark the function as _mmxext instead of mmx
[17:28:38] <BBB> oh :-(
[17:28:48] <BBB> does it matter?
[17:28:54] <BBB> should I make a plain mmx version also?
[17:29:03] <Dark_Shikari> Not unless you care about pentium 2
[17:29:07] <Dark_Shikari> or amd k6
[17:29:11] <BBB> see, this is so silly if the manual doesn't tell me what instruction set a function belongs to
[17:29:50] <Dark_Shikari> we don't care about it in x264.
[17:31:26] <BBB> ok
[17:33:15] <BBB> Dark_Shikari: http://ffmpeg.pastebin.com/raSzU6vh <- my current versions
[17:33:36] <BBB> pshufw is amazing by the way
[17:33:59] <Dark_Shikari> Is it faster?
[17:34:36] <BBB> it was 10% faster than yesterday's version with 1 load, which was 10% faster than the one we looked at (with 4 loads)
[17:34:53] <BBB> the sixtap one is 4x as fast as the C version
[17:35:00] <Dark_Shikari> You should pipeline things a bit more.
[17:35:03] <Dark_Shikari> i.e.
[17:35:05] <BBB> and has less instructions and less memloads (3 vs 6) compared to libvpx
[17:35:07] <Dark_Shikari> pshufw/pshufw/punpck/punpck
[17:35:12] <Dark_Shikari> you have a bit too much linear depnedency there
[17:35:15] <Dark_Shikari> won't hurt on OOE but it's just ugly
[17:35:29] <Dark_Shikari> pshufw mm0/punpck mm0/pshufw mm3/punpck mm3
[17:35:31] <Dark_Shikari> should be
[17:35:37] <Dark_Shikari> pshufw mm0/pshufw mm3/punpck mm0/punpck mm3
[17:35:50] <Dark_Shikari> and yes pshufw is amazing.
[17:36:23] <Dark_Shikari> You should be consistent with your syntax
[17:36:27] <Dark_Shikari> 0x94 and 9 as pshufw arguments?
[17:36:29] <Dark_Shikari> use 0x for both
[17:36:35] <BBB> oh right
[17:36:41] <Dark_Shikari> Other things to note
[17:36:43] <Dark_Shikari> movq mm1, mm2
[17:36:45] <Dark_Shikari> punpcklbw mm1, mm6
[17:36:54] <Dark_Shikari> Either:
[17:36:59] <Dark_Shikari> a) move the thing that uses mm2 to right after movq
[17:37:08] <Dark_Shikari> b) swap mm1 and mm2 for all instructions after movq
[17:37:20] <Dark_Shikari> this decreases the instruction chain length by 1
[17:37:26] <BBB> ?
[17:37:28] <Dark_Shikari> i.e. if you do a mov from a to b, use _a_ immediately after
[17:37:29] <Dark_Shikari> not b
[17:37:40] <Dark_Shikari> because b hasn't been written yet
[17:37:43] <BBB> really?
[17:37:46] <BBB> ok
[17:37:50] <Dark_Shikari> well obviously, the mov takes 1 cycle
[17:37:59] <BBB> I'll just invert the calls before that, less effort
[17:37:59] <Dark_Shikari> so a will be available one cycle before b.
[17:38:10] <Dark_Shikari> Note: this isn't meaningful on fancy CPUs.
[17:38:16] <Dark_Shikari> But, say, an atom might care a lot.
[17:38:43] * BBB wonders if he cares
[17:38:48] <Dark_Shikari> It's good form
[17:38:50] <BBB> I probably should :-p
[17:38:58] <Dark_Shikari> it doesn't take any extra code, doesn't make things uglier
[17:39:15] <enkidu> anyways
[17:40:02] <enkidu> do you remember first infos about h264? "to decode H264 10ghz processor will be needed"
[17:40:11] <BBB> mov access order changed as suggested
[17:40:18] <BBB> let me look at the pipelining a bit more
[17:40:20] <enkidu> as in one of articles from 2005
[17:40:22] <mru> enkidu: said who?
[17:40:28] <mru> certainly nobody with a clue
[17:40:32] <Dark_Shikari> Probably divx
[17:40:40] <Dark_Shikari> they were convinced that h264 was too complicated to ever implement in hardware
[17:40:43] <mru> h264 was designed to be usable
[17:40:48] <Dark_Shikari> There's some particular irony in that
[17:41:03] <mru> h264 was specifically designed to be implemented in hardware
[17:41:07] <Dark_Shikari> exactly
[17:41:20] <enkidu> mru: dunno, it was old article. most ppl were using first p4 then
[17:41:20] <mru> that's where the big money is
[17:41:45] <mru> and yet a 300MHz hw decoder does it without breaking a sweat
[17:42:14] <Dark_Shikari> and even a 3ghz p4 can do 720p h264
[17:42:59] <enkidu> the article was from the age of beating frequency boundaries
[17:43:14] <enkidu> when Intel was on increasing-clock line
[17:44:35] <mru> this i5 laptop plays it nicely at the lowest speed setting
[17:44:59] <BBB> ok, changed the order for pipelining also
[17:45:01] <Dark_Shikari> this i7 laptop plays 4K fine =p
[17:45:12] <BBB> now I need to work on the vertical one
[17:45:19] <av500> mru: new laptop?
[17:45:23] <mru> yeah
[17:45:30] <BBB> or maybe I should do the 8x8/16x16 horizontal-only ones
[17:45:33] <Dark_Shikari> BBB: fyi, for vertical, it's a bit different
[17:45:35] <av500> mru: the sony?
[17:45:38] <mru> yep
[17:45:38] <BBB> Dark_Shikari: I noticed
[17:45:45] <BBB> Dark_Shikari: I was going to look at how x264 does it ;)
[17:45:52] <Dark_Shikari> BBB: the relevant code is the hpel code
[17:45:56] <Dark_Shikari> that's the most similar thing in x264
[17:46:18] <av500> mru: model?
[17:46:25] <av500> (i forgot)
[17:46:26] <mru> z something
[17:46:32] <Dark_Shikari> BBB: mc-a2.asm, lines 144-162
[17:46:41] <Dark_Shikari> Yes, that's an entire row done in 6 multiplies ;)
[17:46:43] <mru> i5, 8GB
[17:46:45] <Dark_Shikari> Of course you're doing mmx.
[17:47:15] <mru> runs linux nicely
[17:47:16] <Dark_Shikari> BBB: x264 actually uses shift/add for non-ssse3 v filter
[17:47:18] <Dark_Shikari> so you can't do that.
[17:47:28] <Dark_Shikari> You'll probably find it most efficient to repeat your original H algorithm
[17:47:31] <Dark_Shikari> Since you won't need the pshufws.
[17:47:53] <Dark_Shikari> I still suggest you glance at the ssse3 one to see how awesome pmaddubsw is.
[17:48:04] <BBB> hehehe :)
[17:48:24] <Dark_Shikari> SBUTTERFLY, fyi, is ABCDEFGH, IJKLMNOP -> AIBJCKDL and EMFNGOHP.
[17:48:37] <Dark_Shikari> aka interleave bottom halves, interleave top halves
[17:48:59] <BBB> let me guess, ssse3 has some awesome instruction for that
[17:49:30] <Dark_Shikari> no
[17:49:35] <Dark_Shikari> it's just mova, punpcklbw, punpckhbw
[17:49:44] <BBB> oh, ok, that's what I do too
[17:49:49] <BBB> just without the macro
[17:49:52] <Dark_Shikari> SBUTTERFLY just does the swap for you
[17:49:57] <Dark_Shikari> so you don't have to track the registers
[17:50:04] <Dark_Shikari> i.e. it outputs to its inputs
[17:50:17] <BBB> but it needs a temp reg right?
[17:50:17] <Dark_Shikari> this gets very important in say a transpose
[17:50:25] <Dark_Shikari> which ends up with a dozen or two dozen butterflies
[17:50:31] <Dark_Shikari> yes, the third argument is the temp reg
[17:50:35] <BBB> ah, of course
[17:50:36] <Dark_Shikari>     SBUTTERFLY bw, 1, 4, 7
[17:50:36] <Dark_Shikari>     SBUTTERFLY bw, 2, 5, 7
[17:50:36] <Dark_Shikari>     SBUTTERFLY bw, 3, 6, 7
[17:50:40] <Dark_Shikari> "bw" is the size.
[17:50:49] <BBB> got it
[17:51:00] <BBB> I'm not really using macros yet
[17:51:15] <BBB> yesterday I rewrote my function (after decreasing loads to 1) to a macro for first 2px and second 2px
[17:51:16] <Dark_Shikari> x264 has an x86util.asm file
[17:51:23] <Dark_Shikari> which contains macros you can use
[17:51:29] <BBB> then I moved to using pshufw
[17:51:31] <Dark_Shikari> this is in ffmpeg too iirc
[17:51:34] <BBB> it is
[17:51:38] <BBB> SBUTTERFLY is there?
[17:51:41] <Dark_Shikari> Yes, I think so.
[17:51:45] <Dark_Shikari> it's used for the transposes.
[17:52:34] <Dark_Shikari> see lines 68-79 of x86util for why sbutterfly is kinda important
[17:52:48] <Dark_Shikari> in x264, at least.  the ffmpeg one might be a bit older.
[17:53:36] <BBB> I see it, hard to keep track
[17:53:54] <BBB> I don't really have a clear butterfly "pattern" here, so I won't use it for now, but I'll keep it in mind
[17:55:11] <BBB> btw that mc-a2.asm thing, pmaddusbw is awesome but I don't have it ;)
[17:55:38] <Dark_Shikari> This is why ssse3 is so great for MC
[17:55:46] <CIA-98> ffmpeg: mstorsjo * r23643 /trunk/libavformat/rtsp.c:
[17:55:46] <CIA-98> ffmpeg: RTSP: Clean up rtsp_hd on failure
[17:55:46] <CIA-98> ffmpeg: Since rtsp_hd isn't assigned to rt->rtsp_hd until after the setup phase,
[17:55:46] <CIA-98> ffmpeg: the initialized URLContext could be leaked on failures.
[17:55:51] <BBB> you'll have to buy me a new cpu for that
[17:56:44] <Dark_Shikari> We can do that.
[17:56:50] <Dark_Shikari> Or just give you an SSH connection to someone who has one.
[17:57:58] <BBB> it has to scratch my itch
[17:58:04] <BBB> just doing it for someone else isn't very useful
[17:58:05] <BBB> :-p
[17:58:19] <lu_zero> eh eh
[17:58:33] <BBB> http://store.apple.com/us/browse/home/shop_mac/family/macbook_pro?mco=MTAyNTQzMzk <- does that one have ssse3?
[17:58:43] <BBB> if so, please buy me one with some extra features
[17:58:48] <av500> it has itunes!
[17:59:10] <Honoome> I think the c2d has ssse3 yeah
[17:59:31] <lu_zero> pfff
[17:59:33] * Honoome found that out when trying to run lu_zero's mplayer static binary
[17:59:45] <lu_zero> sorry...
[17:59:46] <Honoome> on a system that lacks ssse3 that is
[18:00:04] <mru> maybe I should set up the old c2q as ffmpeg dev system
[18:00:09] <mru> along with the g4
[18:00:12] <Honoome> lu_zero: don't worry, I'll give you an sse4.2-compiled feng next time ;)
[18:00:36] <lu_zero> not a problem if qemu or valgrind could run it
[18:00:49] * lu_zero is thinking about updating his laptop anyway
[18:00:58] <Honoome> valgrind has trouble with sse4.1
[18:01:10] <lu_zero> sigh
[18:01:15] <BBB> I like the luxury 15" one, it's only $2200
[18:01:15] <mru> lu_zero: the sony z is nice
[18:01:15] <Honoome> I would hope I'll never have to use valgrind on _this_ system
[18:01:20] <BBB> with some extra features probably $2500
[18:01:22] <BBB> not too bad
[18:01:28] <mru> 13" 1920x1080
[18:01:35] <Dark_Shikari> c2ds are cheap
[18:01:37] <Dark_Shikari> they're like $100
[18:01:55] * Honoome feels quite at home with the dell e6510
[18:02:17] <Honoome> beside the touchpad, and the fact I forgot _again_ to drop the governor to powersave when running battery
[18:02:26] <Honoome> I'm a lousy laptop user
[18:03:35] <BBB> hmm....
[18:03:38] <BBB> $2800
[18:03:39] <BBB> that's a lot
[18:03:46] <BBB> but it's ok, I figured it'd be up to $3k
[18:04:05] <BBB> anyone wanna buy me one?
[18:04:11] <siretart> lu_zero: source yes, but it hasn't built yet
[18:04:28] <Dark_Shikari> BBB: how about a $100 core 2
[18:04:38] <BBB> what do I do with it?
[18:04:39] <Dark_Shikari> as opposed to a $3k one
[18:04:44] <Dark_Shikari> you put it in a $50 board
[18:04:49] <BBB> I don't have a $50 board
[18:04:55] <Dark_Shikari> you buy one
[18:05:02] <BBB> and do what with it? :-p
[18:05:06] <mru> Dark_Shikari: I think he wants a laptop
[18:05:09] <Dark_Shikari> plug it into your power supply
[18:05:09] <BBB> you don't get it, I don't have a desktop
[18:05:12] <Dark_Shikari> mru: I want a pony
[18:05:15] <Dark_Shikari> BBB: you buy a $30 power supply
[18:05:16] <Dark_Shikari> a $20 case
[18:05:31] <mru> Dark_Shikari: then you curse the thing for as long as it runs
[18:05:33] <BBB> where do I put it? I live in a friggin' manhattan-style shoebox-size appartment
[18:05:37] <BBB> I have no space for a desktop
[18:05:40] <BBB> my wife will kill me
[18:05:49] <Dark_Shikari> get a mini-atx
[18:05:54] <BBB> plus I can't carry it around with me
[18:05:54] <lu_zero> siretart: updating from 9. to 10.4 is sloooow...
[18:06:18] <enkidu> BBB: you can try barebone with lcd
[18:06:20] <Dark_Shikari> mru: and who cares?
[18:06:27] <Dark_Shikari> one day of BBB's time is worth more than a core 2 box
[18:06:31] <Honoome> lu_zero: please tell me we're not going to use ubuntu next week...
[18:06:31] <mru> I do, if I'm the one cursing
[18:08:24] <BBB> they're about the same
[18:08:34] <BBB> if you work 8 hrs for $250/hr, that's $2k/day
[18:08:43] <BBB> isn't that wha t desktop costs nowadays?
[18:08:49] <enkidu> it is
[18:08:50] * BBB has no clue about desktop prices
[18:09:02] <BBB> don't forget taxes on incoem
[18:09:12] <enkidu> I bought my netbook for $150
[18:09:15] <Dark_Shikari> you're not worth $250/hr
[18:09:27] <BBB> probably not
[18:09:47] <BBB> but for the few things I do, I get close to that
[18:10:05] <Honoome> more like €30/hr :|
[18:10:35] <BBB> actually that's not true, you're right, I get less
[18:10:43] <BBB> anyway
[18:10:49] <Vitor1001> BBB: I saw sixtap_filter is symmetric. Can you replace a load by a shuffle?
[18:11:21] <BBB> Vitor1001: unfortunately no
[18:11:37] <BBB> Vitor1001: it's symmetric across "mx boundary", not within
[18:11:51] <BBB> mx is constant within one function call
[18:12:09] <siretart> lu_zero: depends on the hardware, but in general, yes
[18:12:22] <Vitor1001> Ok, I see.
[18:15:05] * lu_zero murmurs something about having gentoo on the same hw taking the same time...
[18:15:45] <lu_zero> hopefully the 10.10 will be leaner
[18:18:52] <mru> no, but by then computers will be even faster
[18:30:11] <lu_zero> uff
[18:30:22] <lu_zero> is _still_ updating...
[18:31:12] <mru> what are you doing?
[18:57:58] <Dark_Shikari> pengvado: crazy idea for a lossless format.
[18:58:07] <Dark_Shikari> Every single code is 1 byte.
[18:58:14] <Dark_Shikari> Each byte code maps to a variable number of _pixels_
[18:58:33] <Dark_Shikari> This number of pixels is <= WORD_SIZE, so they can be branchlessly written in one unaligned store.
[18:59:03] <Dark_Shikari> so you write a 32-bit code to the bitstream, and increment the pointer by a variable amount from your code table.  No branches.
[18:59:40] <Dark_Shikari> Escape codes are simple: a table lookup results in a zero number of bytes written and the next byte is a raw pixel.
[18:59:47] <Dark_Shikari> (next byte from the bitstream)
[19:00:00] <Dark_Shikari> 2-byte codes would allow eliminating escapes, but it'd exceed L1 cache.
[19:00:16] <Dark_Shikari> This system would allow 100% branchless decoding with the exception of escapes.
[19:00:41] <Dark_Shikari> my question is how you optimize such a code table.
[19:00:57] <mru> sounds like some kind of VQ
[19:01:04] <Dark_Shikari> normally you choose code lengths to match probabilities -- but here you want to choose pixel lengths to make probabilities equal
[19:01:19] <Dark_Shikari> huffyuv already does VQ -- two pixels per code iirc
[19:01:29] <Dark_Shikari> this is constant-code-length VQ
[19:06:44] <BBB> is there like a "word-splat" instruction for mmx/mmxext? to take one (or just the lowest) word of a mm register and splat it over a target register?
[19:06:55] <astrange> huffyuv uses one vlc for every pixel channel
[19:07:12] <astrange> the ffmpeg decoder just uses a joint vlc table so it can read more than one at once
[19:10:34] <Dark_Shikari> BBB: pshufw
[19:10:39] <BBB> oh of course
[19:10:41] <BBB> duh
[19:30:25] <BBB> Dark_Shikari: http://ffmpeg.pastebin.com/7dxa6qM2 is that any good?
[19:30:49] <BBB> actually the 4tap v4 was easy
[19:31:37] <Dark_Shikari> yeah, it's suppoesd to be
[19:31:47] <Dark_Shikari> Oh, you do that trick to avoid repeating row loads
[19:31:48] <Dark_Shikari> good idea
[19:32:05] <Dark_Shikari> wait what's with the splatting of coefs
[19:32:23] <Dark_Shikari> you're repeating that in every row
[19:32:24] <Dark_Shikari> what a waste
[19:32:44] <Dark_Shikari> you would be better pmullw'ing with memory
[19:33:12] <mru> can sse multiply by one element from a vector?
[19:33:17] <Dark_Shikari> no
[19:33:37] <mru> I find that useful
[19:33:51] <Dark_Shikari> BBB: keep in mind in x86, memory loads are FREE as long as the memory unit is not saturated and there's no risk of a cache miss
[19:33:56] <mru> keeping coeffs in a single reg
[19:34:09] <BBB> removing the loads makes it faster though
[19:34:11] <BBB> so it's not free
[19:34:13] <Dark_Shikari> thus, it's better to pmullw against memory than to add actual new ops
[19:34:16] <BBB> (I tested that for the h4, not the v4)
[19:34:17] <Dark_Shikari> Which loads?
[19:34:20] <mru> Dark_Shikari: same is true on cortex-a8
[19:34:22] <Dark_Shikari> The pixel loads?
[19:34:25] <BBB> yes
[19:34:27] <mru> but usually you end up being memory-bound
[19:34:29] <BBB> or you mean the coeff load?
[19:34:31] <Dark_Shikari> Yes, that's because those can be cache misses
[19:34:34] <Dark_Shikari> Because those can cross a cacheline
[19:34:46] <Dark_Shikari> An aligned load off a global constant will never cross a cache line
[19:35:01] <Dark_Shikari> it is better to pmullw against memory than to pshufw to create the multiplication factors
[19:35:09] <mru> no aligned load will cross a cache line
[19:35:11] <mru> for obvious reasons
[19:35:12] <Dark_Shikari> exactly
[19:35:20] <Dark_Shikari> Also, I strongly suggest you pipeline things
[19:35:25] <Dark_Shikari> that is, place all the pmullws next to each other
[19:35:27] <Dark_Shikari> this is better for OOE
[19:35:42] <mru> better than mixing with load/store?
[19:35:43] <Dark_Shikari> and readability
[19:35:44] <BBB> I will pipeline after the function itself is satisfactory ;)
[19:36:03] <Dark_Shikari> mru: load/store doesn't use arithmetic units, so it doesn't matter
[19:36:14] <Dark_Shikari> what matters is getting the cpu to use p1 for multiply first
[19:36:22] <Dark_Shikari> modern intel chips have three alus, p0 p1 and p5
[19:36:33] <Dark_Shikari> adds can use all three, shuffles can use p0 (p0 and p5 on nehalem)
[19:36:35] <mru> separate address gen unit?
[19:36:36] <Dark_Shikari> multiply can only use p1
[19:36:38] <Dark_Shikari> yes
[19:36:47] <Dark_Shikari> p1 also does float stuff
[19:36:54] <Dark_Shikari> p1 is generally _horribly_ underused in most integer code
[19:37:02] <Dark_Shikari> because it can only be used by moves, shifts, and multiplies, iirc
[19:37:15] <Dark_Shikari> But, when OOE is selecting which execution unit to use for, say, an add
[19:37:17] <Dark_Shikari> it isn't smart
[19:37:20] <Dark_Shikari> it will just pick the first avaialble one
[19:37:23] <pengvado> Dark_Shikari: int8 codes for int32 pixel-blocks puts an upper bound of 4 on the compression ratio
[19:37:26] <pengvado> this is very bad
[19:37:28] <Dark_Shikari> So you want to get p1 used up by multiply as soon as possible
[19:37:32] <Dark_Shikari> pengvado: I meant for something huffyuv-like.
[19:37:38] <Dark_Shikari> Not for, say, ffv2.
[19:37:54] <pengvado> in huffyuv, more than 1/2 of all samples are 0
[19:38:06] <Dark_Shikari> huffyuv rarely gets more than 2-2.5x compression
[19:38:33] <Dark_Shikari> even on easy stuff like anime
[19:38:42] <pengvado> that doesn't mean it doesn't suffer when you double the bitrate of the low residual sections
[19:38:49] <BBB> Dark_Shikari: so how do I multiply by a mem constant? you think I should create a RODATA with the 4x repeated coeffs?
[19:38:54] <BBB> that sounds wasteful
[19:38:56] <Dark_Shikari> pengvado: what's the cap on huffyuv?
[19:39:01] <Dark_Shikari> BBB: yes
[19:39:06] <pengvado> 8x. 1 bit per sample.
[19:39:18] <Dark_Shikari> pengvado: so if we used WORD_SIZE=4, that would be the same limit as huffyuv
[19:39:21] <Dark_Shikari> er, =8
[19:40:07] <mru> Dark_Shikari: your constant-code-length vq should be very fast to decode
[19:40:12] <Dark_Shikari> mru: that's the idea
[19:40:20] <mru> did you intend it to be lossless or lossy?
[19:40:23] <Dark_Shikari> lossless
[19:40:30] <mru> why not make a lossy version?
[19:40:36] <Dark_Shikari> that would be interesting
[19:40:48] <pengvado> lossy version is called CYUV
[19:40:52] <Dark_Shikari> CYUV?
[19:41:01] <pengvado> though that's not VQ
[19:41:08] <pengvado> just ADPCM for video
[19:41:18] <Dark_Shikari> ah lol
[19:41:22] <Dark_Shikari> Actually -- if it was lossy, you could completely eliminate the escape codes.
[19:41:34] <Dark_Shikari> you could allocate, say, half the table for common combinations of pixels
[19:41:38] <Dark_Shikari> and the other half for _quantized_ single pixels
[19:41:51] <Dark_Shikari> or whatever combination is RD-wise the best
[19:41:58] <Dark_Shikari> then you could have the entire decoder 100% branchless like cyuv
[19:42:09] <pengvado> thing is, I suspect it would be slower to decode than JPEG since it would require much higher bitrate per quality
[19:42:25] <Dark_Shikari> But decoding would be vastly simpler
[19:42:52] <Dark_Shikari> But hmm.  Might be right on that.
[19:42:54] <Dark_Shikari> though idct is slow
[19:43:00] <pengvado> so use hadamard instead
[19:43:17] <Dark_Shikari> hadamard works for real compression?
[19:43:29] <pengvado> or ihct
[19:43:33] <Dark_Shikari> yeah.
[19:43:43] <mru> whatever h264 uses is fast
[19:43:47] <mru> the transform
[19:44:01] <Dark_Shikari> hct
[19:44:04] <Dark_Shikari> h264 cosine transform
[19:44:10] <Dark_Shikari> anyways I think this would be more interesting for lossless
[19:45:02] <Dark_Shikari> I just don't know how to optimize such a table, that's the problem
[19:45:15] <Dark_Shikari> it seems like it shouldn't be too bad, it's the inverse of huffman
[19:46:06] <pengvado> CABGT is the inverse of huffman
[19:46:56] <Dark_Shikari> why's that?
[19:47:23] <Dark_Shikari> I would think the opposite of variable-length codes containing constant amounts of information is constant-length codes containing variable amounts of information
[19:47:57] <pengvado> CABGT literally uses a reverse huffman coder. i.e. a vlc reader in the encoder and a vlc writer in the decoder.
[19:48:07] <Dark_Shikari> lol
[19:48:15] <Dark_Shikari> well so that's another way of having a "reverse"
[19:49:11] <Dark_Shikari> BBB: fyi, probably the best way to do the hv positions is to generate h data and v-filter it
[19:49:24] <pengvado> problem is that the optimal fixed length code containing a variable amont of information must assign one and only one token to the prefix of any data stream
[19:49:30] <BBB> you mean for h&&v subpel?
[19:49:32] <Dark_Shikari> yes
[19:49:46] <Dark_Shikari> pengvado: explain?
[19:49:47] <pengvado> (which corresponds to the constraint that huffman must uniquely decode any bitstream)
[19:49:51] <BBB> I think I was just going to write a quick wrapper that calls my hxvy functions ;)
[19:50:03] <Dark_Shikari> BBB: ?
[19:50:43] <BBB> just place a temp buffer of 9x4 pixels (?) on the stack and use it by calling the v-only and h-only functions
[19:50:58] <BBB> is that bad?
[19:51:16] <Dark_Shikari> oh you'll call H with a height of whatever
[19:51:18] <Dark_Shikari> and then V-filter it
[19:51:22] <Dark_Shikari> ok, that works.
[19:51:33] <Dark_Shikari> when I did this for h264 I wrote it all in asm.
[19:51:44] <BBB> hmm... yes but you are hardcore
[19:51:54] <Dark_Shikari> It also sucked my soul out.
[19:51:58] <pengvado> if you have 256 codes, and some of them are multiple pixels, you can't handle all possible pairs of pixels (let alone larger tuples). and most ways of handling subsets of possible pixel pairs leaves redundancy in the bitstream unless you have a DFA switching between lots of tables.
[19:52:04] * mru points at neon qpel code
[19:52:11] <Dark_Shikari> yeah, mru did it too
[19:52:16] <Dark_Shikari> and pengvado
[19:52:30] <BBB> mru: you may write the neon function to be totally awesome
[19:52:39] <BBB> in fact, maybe you can teach me neon and I'll test it on my iphone
[19:52:40] <mru> that's the most monstrous piece of asm I've ever written
[19:52:44] <Dark_Shikari> mru: same here
[19:52:51] <mru> ~1000 lines of intertwined functions
[19:52:52] <Dark_Shikari> the x86 version was the most monstrous for me
[19:52:58] <Dark_Shikari> of anything
[19:53:04] <Dark_Shikari> pengvado: wait, explain why it can't be optimal?
[19:53:18] <Dark_Shikari> Oh, you mean the fact that
[19:53:21] <Dark_Shikari> suppose I have "0 100"
[19:53:26] <Dark_Shikari> I won't have a code, so I need a code for "0"
[19:53:32] <Dark_Shikari> But I'll also have a code for "0 0 0"
[19:53:35] <BBB> yeah see, I'm not looking forward to writing 1000 lines of asm code just for fun while I just wrote my first asm like yesterday
[19:53:54] <Dark_Shikari> BBB: this is a reasonable approach
[19:53:58] <Dark_Shikari> anyways, after this, do 8x8x
[19:54:00] <Dark_Shikari> or sse
[19:54:01] <BBB> it'll most likely not work and I'll pull my hair out figuring out why the h#ll ;)
[19:54:09] <Dark_Shikari> imo let's do 4x4hv first (your wrapper)
[19:54:13] <pengvado> right, so after coding "0", you either switch to another table that doesn't support anything starting with "0", or you waste bits.
[19:54:15] <Dark_Shikari> then do 4x4 sse (so you can get the hang of that)
[19:54:19] <BBB> I'll finish 4x4 first
[19:54:30] <BBB> didn't you say 8x8 was just a wrapper around 4x 4x4?
[19:54:32] <Dark_Shikari> pengvado: ouch
[19:54:40] <Dark_Shikari> BBB: not optimally
[19:54:49] <mchinen> wow you guys are hardcore
[19:54:51] <BBB> suboptimally :-p
[19:54:57] <mchinen> does everyone here write demuxers in asm?
[19:55:00] <Dark_Shikari> BBB: I would say the optimal way to do it is
[19:55:06] <mru> mchinen: no, we don't do demuxers in asm
[19:55:06] <BBB> mchinen: no, that's a waste of time
[19:55:10] <Dark_Shikari> 1) mmx is width 4.  w8 and w16 call it.
[19:55:14] <Dark_Shikari> 2) sse is width 8.  w16 calls it.
[19:55:21] <Dark_Shikari> 3) ssse3 is width 8 and width 16.  no wrappers.
[19:55:26] <mru> demux doesn't even show up on profile charts
[19:55:38] <Dark_Shikari> unless it's ogg?
[19:55:42] <mru> not even that
[19:55:57] <BBB> mchinen: did you talk to baptiste already?
[19:56:06] <mchinen> BBB: no, not yet
[19:56:15] <BBB> hmm...
[19:56:19] <BBB> did you ping him?
[19:56:34] <Dark_Shikari> BBB: do you get, now that you've written it, why width8 would probably not be worth writing in mmx?
[19:57:03] <mchinen> BBB: no, i just mailed him
[19:57:07] <BBB> yeah, it would just be a double-version of what I just had, because there's probably not a very much more optimal way to write it
[19:57:16] <Dark_Shikari> Yeah
[19:57:25] <BBB> although I guess I could write the final 8 bytes/row all at once
[19:57:28] <BBB> but that would save one call
[19:57:48] <BBB> but I'd lose a register holding the first 4bytes
[19:57:52] <BBB> so it would suck anyway
[19:57:53] <Dark_Shikari> yeah
[19:58:04] <BBB> hmk
[19:58:34] <BBB> I'll finish the 4x4 v modes, look briefly into making hv mix functions and then I'll go for sse1/2/whatever in 4x4 and 8x8
[19:58:45] <BBB> did I mention pshufw is awesome?
[19:58:51] <Dark_Shikari> Wait until you get to play with pshufb.
[19:59:06] <Dark_Shikari> It's almost unfun
[19:59:07] <Dark_Shikari> easy mode
[19:59:16] <BBB> pshufb is... sse2? or ssse3?
[19:59:25] <Dark_Shikari> ssse3
[19:59:35] <BBB> yeah not gonna happen, my crappy cpu doesn't love me
[19:59:46] <BBB> I'm getting my new laptop in a couple of months
[20:00:07] <Dark_Shikari> when what happens?  it's not like there's new tech coming out in 3 months
[20:01:00] <BBB> present :-p
[20:02:04] <BBB> I've got a 1/3ed gift cert, my work is paying 1/3rd and the last 1/3rd I'll get from my parents as a graduation gift once my PhD is done, = shiny new laptop that I just pointed out
[20:02:50] <Dark_Shikari> why not use ssh in the meantime
[20:07:52] <lu_zero> mru: preparing a foolproof setup
[20:08:34] <CIA-98> ffmpeg: fenrir * r23644 /trunk/ (4 files in 2 dirs):
[20:08:34] <CIA-98> ffmpeg: MPEG-2 DXVA2 implementation
[20:08:34] <CIA-98> ffmpeg:  It allows VLD MPEG-2 decoding using DXVA2 (GPU assisted decoding API under
[20:08:34] <CIA-98> ffmpeg: VISTA and Windows 7).
[20:08:34] <CIA-98> ffmpeg:  It is implemented by using AVHWAccel API.
[20:08:35] <lu_zero> BBB: yet another section for the foundation site
[20:09:59] <lu_zero> "feed us with hw"
[20:10:23] * _av500_ feeds lu_zero with obsolete TI EVMs
[20:13:06] <BBB> lu_zero: mplayerhq has that :-p
[20:13:21] <j-b> \o/ DxVA2 mpeg2
[20:13:28] <BBB> Dark_Shikari: to be able to more loudly make the point that I need hw :-p
[20:17:37] <mru> lu_zero: for every foolproof setup there is a new and improved fool
[20:20:22] <Dark_Shikari> GAH
[20:20:28] <Dark_Shikari> I hate it when I write an asm function that's 10 instructions shorter
[20:20:30] <Dark_Shikari> and is somehow not any faster
[20:21:22] <hyc> lol... that's OOE chips for you
[20:21:50] <hyc> and/or, you've hit a memory bandwidth limit
[20:23:06] <Dark_Shikari> or pinsrd just sucks
[20:28:15] <mru> or you suck :-)
[20:28:35] <Dark_Shikari> I think it's that pinsrd just sucks.
[20:36:53] <lu_zero> mru: I know
[20:37:21] <lu_zero> pinsrd?
[20:38:31] <Dark_Shikari> insert doubleword
[20:38:40] <Dark_Shikari> aka load 4 bytes into one of the four positions in a register
[20:38:57] <lu_zero> vector you mean
[20:39:04] <mru> same thing
[20:39:08] <Dark_Shikari> vector register :)
[20:39:17] <lu_zero> that  =)
[20:39:21] <lu_zero> uhmm
[20:39:30] <Dark_Shikari> e.g. this kind of code that keeps showing up
[20:39:31] <Dark_Shikari>     movd       xmm4, [r1+FDEC_STRIDE*0-4]
[20:39:31] <Dark_Shikari>     pinsrd     xmm4, [r1+FDEC_STRIDE*1-4], 1
[20:39:31] <Dark_Shikari>     pinsrd     xmm4, [r1+FDEC_STRIDE*2-4], 2
[20:39:32] <Dark_Shikari>     pinsrd     xmm4, [r1+FDEC_STRIDE*3-4], 3
[20:39:50] <Dark_Shikari> aka "load 4 rows of 4 bytes each from a strided array into this 16-byte register"
[20:39:55] <Dark_Shikari> aka "where is my scatter-gather load!!!"
[20:40:06] <lu_zero> ugh
[20:40:35] <mru> I already told you why scatter-load is hard
[20:40:44] <Dark_Shikari> I know -- you need more L1 load units
[20:40:46] <lu_zero> that could be done using load+permute
[20:40:51] <Dark_Shikari> lu_zero: stride of the array is 32
[20:40:56] <mru> you could also end up with multiple tlb misses
[20:41:02] <lu_zero> meh
[20:41:04] <Dark_Shikari> mru: that's not the problem
[20:41:11] <Dark_Shikari> TLB misses are fine--all you have to do is serialize it whenever one occurs.
[20:41:17] <lu_zero> spu load+permute
[20:41:22] <mru> Dark_Shikari: requires more hardware
[20:41:27] <Dark_Shikari> Any "hard problem" caused by having scatter/gather load can be solved by serializing if the hard problem occurs
[20:41:34] <Dark_Shikari> it doesn't require more hardware to not do something.
[20:41:54] <mru> it requires hardware to detect it and issue the sequence of ops
[20:41:55] <ohsix> does in a cpu
[20:42:15] <Dark_Shikari> well of course, but the point is I think we could probably get some special-cased improved L1 bandwidth in at least some cases.
[20:42:20] <Dark_Shikari> even if "special-cased" means
[20:42:26] <lu_zero> meh...
[20:42:30] <Dark_Shikari> "only if it is in L1, no TLB miss, doesn't cross cachelines"
[20:42:36] <Dark_Shikari> "and is aligned"
[20:42:56] <Dark_Shikari> the current one-load-per-cycle kinda sucks
[20:43:20] <Dark_Shikari> btw, what's with stuff like DSPs that do have scatter/gather load?
[20:43:27] <mru> they don't
[20:43:30] <mru> never heard of one
[20:43:38] <Dark_Shikari> then what has it?
[20:43:41] <mru> nothing
[20:43:44] <Dark_Shikari> or is it just a theoretical capability that doesn't exist?
[20:43:57] <mru> it's something everybody wants but nobody has
[20:43:57] <ohsix> dsps have loads with strides and offsets for packing stuff up, don't they?
[20:44:18] <mru> although a dsp typically has builtin L1 sram
[20:44:20] <mru> non-cache
[20:44:29] <Dark_Shikari> Wikipedia says many DMA engines have it
[20:44:33] <Dark_Shikari> e.g. for Cell SPUs
[20:44:34] <mru> so part of the problem does go away there
[20:44:42] <lu_zero> Dark_Shikari: uhm
[20:44:43] <Dark_Shikari> but that's a bit different.
[20:44:50] <mru> dma engines operate sequentially
[20:44:55] <lu_zero> cell spu has explicit manipulation
[20:45:03] <lu_zero> but isn't the same thing
[20:45:28] <Dark_Shikari> mru: oh wow
[20:45:29] <Dark_Shikari> http://www.patents.com/Microprocessor-high-speed-memory-integrated-loadstore-unit-efficiently-perform-scatter-gather-operat-7707393.html
[20:45:34] <Dark_Shikari> issued just a few weeks ago
[20:45:35] <Dark_Shikari> lol
[20:45:59] <Dark_Shikari> Broadcom
[20:46:08] <lu_zero> next mips for your pleasure
[20:46:22] <mru> not mips
[20:46:24] <Dark_Shikari> So it seems _someone_ does care.
[20:46:29] <mru> some other part of a bcm chip
[20:46:32] <Dark_Shikari> Cares enough to patent it.
[20:47:14] <lu_zero> Dark_Shikari: btw what you wanted to archive there?
[20:47:44] <lu_zero> you just need that part or the further ones would be needed as well?
[20:49:01] <lu_zero> I wonder if the scatter-gather ops isn't that much considered just because is easier to add more registers or enlarge them
[20:49:54] <Dark_Shikari> adding more registers doesn't solve the problem of slow loads
[20:50:12] <Dark_Shikari> Hmm.  This would make for an interesting CISC machine
[20:50:22] <Dark_Shikari> an instruction that does strided gather loading -- but internally, maps to a normal load unit.
[20:50:43] <Dark_Shikari> to decrease code size
[20:52:07] <lu_zero> uhmm
[20:52:12] <iive> Dark_Shikari: in that example above... are you sure the bottleneck is not that you are using same register?
[20:52:30] <Dark_Shikari> iive: I interleaved two of them
[20:52:45] <Dark_Shikari> I omitted the second for brevity
[20:52:50] <Dark_Shikari> And, here's some irony for you
[20:52:51] <lu_zero> you are still doing loads
[20:52:52] <Dark_Shikari> I just deinterleaved them
[20:52:54] <Dark_Shikari> and it got faster
[20:53:24] * lu_zero wonders why
[20:53:46] <Dark_Shikari> because of ordering
[20:53:56] <Dark_Shikari> I'm guessing core i7 tracks dependencies internally so there's no cost to deinterleaving
[20:53:57] <lu_zero> those internally _must_ be load+mask or load+perm
[20:54:05] <Dark_Shikari> so deinterleaving allowed us to reorder the loads
[20:54:08] <Dark_Shikari> and get one of the registers finished faster
[20:54:11] <Dark_Shikari> and start arith ops faster
[20:54:15] <Dark_Shikari> because I was loading two registers, xmm1 and xmm4
[20:54:21] <Dark_Shikari> one of which was being used immediately (xmm4)
[20:54:26] <Dark_Shikari> the other of which wasn't used until about 10 instructions later
[20:54:34] <Dark_Shikari> so by letting it postpone the latter loads, it could start doing work faster.
[20:55:00] <lu_zero> basically the i7 is doing even more work behind our backs
[20:55:22] <lu_zero> that's what I hate about x86
[20:55:57] <lu_zero> ops should be a _bit_ more predictable, even dumber
[20:56:32] <Dark_Shikari> well imo if they can make the cpu smart without increasing cycle time
[20:56:34] <Dark_Shikari> they should feel free.
[20:57:07] <lu_zero> Dark_Shikari: and that makes your code slower since some assuptions get broken
[20:57:19] <Dark_Shikari> not really.
[20:57:45] <lu_zero> that's probably fine with exotic stuff like this load+mask/perm hybrid
[20:58:04] <lu_zero> but for plain loads would be quite depressing
[21:16:12] <BBB> mchinen: if he doesn't respond by tonight, ping the email, I'll see if I can get to him this weekend
[21:17:12] * BBB is confused because his vertical sixtap filter takes less cycles than his fourtap filter
[21:18:03] <Dark_Shikari> BBB: that's because it requires less mungnig
[21:18:07] <Dark_Shikari> it's normal for vert to be faster
[21:18:15] <BBB> ehm
[21:18:16] <BBB> no
[21:18:20] <BBB> both are vertical
[21:18:23] <Dark_Shikari> oh
[21:18:25] <Dark_Shikari> sixtap vs WHAT?
[21:18:29] <BBB> the vertical sixtap is faster than the vertical fourtap
[21:18:32] <Dark_Shikari> er... how are you timing it?
[21:18:39] <BBB> START/STOP_TIMER
[21:18:44] <Dark_Shikari> What if the 4-tap is used on chroma only
[21:18:47] <Dark_Shikari> which is more likely to have cache misses?
[21:18:53] <Dark_Shikari> You have to time them doing the same thing
[21:18:58] <BBB> hm...
[21:18:58] <Dark_Shikari> i.e. make the 4-tap call the 6-tap instead
[21:18:59] <BBB> good point
[21:19:02] <BBB> ok
[21:19:06] <Dark_Shikari> Of course, that's not really an issue
[21:19:07] <BBB> will do that after I make it bitexact
[21:19:08] <Dark_Shikari> you dont have to do that
[21:19:10] <BBB> it doesn't work yet ;)
[21:19:12] <Dark_Shikari> you _know_ the 4-tap is faster than 6-tap
[21:19:25] <Dark_Shikari> so you don't need to compare two different functions
[21:19:26] <BBB> well yeah it has less instructions and less mem accesses
[21:19:30] <Dark_Shikari> you compare different versions of the same function
[21:25:30] <Dark_Shikari> http://www.linuxfordevices.com/c/a/News/Avalue-EPIQM57/?kc=rss   hmm this is rather cool
[21:25:39] <Dark_Shikari> 18-watt TDP for a whole core i7 system
[21:26:03] <kierank> not bad at all
[21:27:59] <iive> 18W is just the cpu
[21:28:12] <Dark_Shikari> oh, true.  it seems they aren't counting the board
[21:28:14] <Dark_Shikari> but it's a small board.
[21:28:24] <Dark_Shikari> still, 18 watt tdp is really low
[21:29:40] <iive> they also don't say if it is idle or under load... typical is way too broad term.
[21:30:12] <mru> which i7 is that?
[21:30:22] <Dark_Shikari> mru: one of the low power ones
[21:30:24] <iive> some mobiles.
[21:30:25] <Dark_Shikari> dual core i7
[21:30:38] <mru> the 9xx are >100W TDP...
[21:30:44] <Dark_Shikari> runs at 1ghz or so with turbo boost to 2ghz
[21:30:50] <iive> 620UE
[21:30:51] <Dark_Shikari> i.e. 2 cores at 1ghz or 1 core at 2ghz
[21:31:04] <Dark_Shikari> or something like that
[21:31:24] <BBB> vertical sixtap, almost 5x faster
[21:31:25] <BBB> \o/
[21:31:36] <Dark_Shikari> :)
[21:31:38] <Dark_Shikari> and it works?
[21:31:41] <BBB> yeah
[21:31:52] <Dark_Shikari> pastebin?
[21:32:22] <BBB> http://ffmpeg.pastebin.com/LYhUKGUi
[21:32:31] <BBB> the start is a little ugly
[21:33:15] <BBB> but saving 5 pixels + 1 cache + 1 for the coeffs leaves little arith space
[21:33:55] <BBB> hm, the comment for the last tap is wrong
[21:34:01] <Dark_Shikari> I thought you would get rid of the splats
[21:34:23] <BBB> oh yeah I didn't do that yet :-p
[21:34:45] <Dark_Shikari> and the redundant pxor -- that should be a globally kept zero
[21:34:54] <Dark_Shikari> imul r4,3 --> no, use an lea
[21:35:02] <Dark_Shikari> lea r4, [r4*3]
[21:35:16] <Dark_Shikari> I don't see the point of line 5.
[21:35:44] <Dark_Shikari> that can all go into the addressing.
[21:37:06] <iive> r4*2+r4 ?
[21:37:15] <Dark_Shikari> r4*3 is fine in yasm syntax
[21:37:22] <iive> or the macro takes care of that?
[21:37:32] <Dark_Shikari> yasm takes care of it
[21:37:39] <iive> oh, yasm iself.
[21:39:47] <BBB> Dark_Shikari: it's r4*6
[21:39:52] <BBB> Dark_Shikari: yasm didn't eat it when I tried
[21:40:03] <BBB> I think I tried r4*12 though
[21:40:12] <BBB> I thought you could only do 1, 2, 4 or 8
[21:40:51] <iive> BBB: yasm turns r4*3 into op containing r4*2+r4, that's what i asked above.
[21:41:19] <BBB> libavcodec/x86/vp8dsp.asm:206: error: invalid effective address
[21:41:21] <BBB> for r4*6
[21:41:30] <iive> of course...
[21:41:53] <iive> you can do 5, with (r4*4+r4)
[21:42:08] <Dark_Shikari> BBB: I said r4*3
[21:42:10] <Dark_Shikari> not r4*6
[21:42:11] <BBB> Dark_Shikari: I'm out of registries for the global zero, or do you mean a regular r%d registry?
[21:42:20] <Dark_Shikari> BBB: no, you have one extra, because you saved mm7
[21:42:28] <BBB> mm7 is the coeffs
[21:42:32] <Dark_Shikari>  Which you're saving.
[21:42:36] <Dark_Shikari> Because you're turning it into memory.
[21:42:37] <BBB> I'll try :-p
[21:42:42] <Dark_Shikari> to eliminate pshufw
[21:42:44] <BBB> yes sir!
[21:43:05] <Dark_Shikari> remember it can do one load per cycle
[21:43:16] <Dark_Shikari> so it won't cost anything.
[21:43:21] <BBB> movd = one load?
[21:43:22] <Dark_Shikari> well, or at least less than the pshufw.
[21:43:25] <BBB> or a byte is one load?
[21:43:30] <Dark_Shikari> movd is one load
[21:43:31] <Dark_Shikari> mov is one load
[21:43:33] <Dark_Shikari> movq is one loa
[21:43:34] <Dark_Shikari> *load
[21:43:37] <Dark_Shikari> movdqa is one load
[21:43:41] <Dark_Shikari> blah X, [mem] is one load
[21:43:58] <BBB> ok, when I tested it was slower, but I'll test again
[21:44:48] * BBB goes home for now
[21:44:50] <BBB> this is fun
[21:44:52] <Dark_Shikari> pastebin it when you test it
[21:44:56] <Dark_Shikari> so I know you're doing it right
[21:45:05] <BBB> in the weekend, I will
[21:45:11] <BBB> and then we'll do sse/sse2
[21:45:22] <BBB> I don't need to care about sse right? just sse2?
[21:45:25] <BBB> or is there sse-only cpus?
[21:45:38] <Compn> a ton of athlons are sse (no sse2)
[21:45:53] <BBB> amd isn't paying me, so screw them
[21:45:54] <Dark_Shikari> sse1 is float only
[21:45:57] <Dark_Shikari> you don't care about sse1
[21:45:59] <BBB> ok
[21:46:01] <Dark_Shikari> pentium 3 is sse-only
[21:46:04] <BBB> sse2 it is then
[21:46:06] <Compn> like athlon 600mhz - 1.5 ghz or so, i think
[21:46:12] <Compn> oh yeah pentiums
[21:46:34] <iive> of course pentiums, they practically invented it :P
[21:47:28] <iive> and I don't think there is athlon 600MHz that have sse1, they have mmx-ext (mmx2) but athlon XP was the first to have sse, and that was way above 1ghz
[21:50:25] <Compn> ah
[23:18:11] <CIA-98> ffmpeg: michael * r23645 /trunk/libavformat/raw.c:
[23:18:11] <CIA-98> ffmpeg: Improve h263_probe()
[23:18:11] <CIA-98> ffmpeg: Fixes issue2015


More information about the FFmpeg-devel-irc mailing list