[FFmpeg-devel] a64 encoder 7th round

Michael Niedermayer michaelni
Wed Jan 28 22:28:23 CET 2009


On Tue, Jan 27, 2009 at 10:04:47PM +0100, Bitbreaker/METALVOTZE wrote:
> 
> > so you claim that copying 256 chars is faster than copying none ?
> > if not a system that choose per frame if it used the last or used new
> > ones would be better, a fixed 4frame pattern hardly is optimal.
> >   
> Okay, seems like i have to explain more in detail :-)
> The multicolor displayer throws an interrupt each 2 frames and switches 
> nothing more than a bit in a single register (kind of charmap pointer of 
> the video chip) and thus advance to a different charmap (0x400 in size). 
> After 4 charmaps were displayed this way, the intrerupt routine unlocks 
> the loader to load the next charset + 4 screens into the current buffer 
> (in 0x400 big chunks/packets via network) but switches to the previous 
> loaded charset and screens beforehand. So it more or less is double 
> buffering with 4 preloaded frames.
> In your suggested scenario i have to take care of the worst case 
> scenario/cross point, have to know beforehand how many bytes i need to 
> load, take care of framesizes. This sounds all trivial, but:
> Odd framesizes bloat up the loader, as do varying framesizes, as i need 
> additional checks, loading can get easily 50% slower in that case (5 
> additional cycles to 11 cycles or just even 8 cycles if using generated 
> speedcode), so the cross over point goes rather low, as handling a 
> charset delta consumes even more cycles. As the worst case is as slow as 
> if loading plain frames, there is no gain, as framerate and quality do 
> not improve, but adds lot of complexity to the displayer (that is 
> already rather big for that, as it needs to drive the network chip and 
> handle packets). So all i'd do, is saving diskspace, but a 500MB mpeg 
> file shrinks to a ~ 50MB .a64 file at the moment, so not too much of a 
> waste :-)
> The 6502 is just a very scarce platform, offering only 52 different 
> instructions not all of them being even orthogonal. I have three 8 bit 
> registers only as well as a 8 bit data bus only + a 16 bit address bus. 
> There is no multiply or divide instruction. So concepts that work out 
> fine on nowadays machines often have to be done in a completely 
> different way on such machines, often, by making it just plain and easy, 
> or by doing some fake, that appears to do the same ;-) I invested quite 
> some time in finding the appropriate display methods, i have done first 
> prototypes to convert already years ago, and discussed a lot with other 
> c64 scene members to work out the modes i have so far implemented. As 
> for doing things on a c64, i can look back to the year 1988 where i did 
> my first trys on that machine. So things on c64 side should already be 
> rather optimal, but of course the codecs themselves may have still lots 
> of potential for (speed/quality) improvements. Saving size so far does 
> not bring any improvment, except when i can reduce framesize in every case.

this is somewhat similar to mpegs VBV
mpeg decoders also have strict limits on what buffer and bitrate they can
handle.
You can compare pure I frame mpeg1 against I+P frame mpeg1 with the same
bitrate and buffer constraints.
Surely the addition of P frames does not help in every case, one always can
just encode unrelated frames but where it does help the smaller P frames
allow the freed up space and bitrate to be used by something else to improve
quality ...

[...]

> > with this limitation a pure multicolor encoder should do the following
> >    for each frame try all 3 fixed color triplets out of 16 that are
> >    560 full frame encodes, isnt going to be terribly fast but it should
> >    be easy to skip some of these triplets.
> >    for each block try all of the 8 colors and then from the 4 choosen
> >    colors select per pixel colors with error diffusion dither choosing
> >    the best block with sum of abs diff in dct domain.
> I have that special table color_mixes, that tells my code (not 
> multicolor) what colors are a good idea to mix (no matter if by 
> interlacing or dithering) and what colors are definetedly a no go. There 
> are quiet a lot of ugly combinations and some colors really clash 
> terribly in PAL. 

whichever way it looks its not going to emmit gamma radiation but rather a
picture of colors
If this picture was the input then the use of the specific colors is optimal
because that is how it is supposed to look.
This really very strongly points towards something being wrong with how you
choose the colors.
You simply have to simulate how something will look and compare that with a
good comparission function against the input picture.
if PALs filtering and modulation is significant so should it be considered
as well.


> Also having one color being changed each block (while 
> all others stay the same) leads to a blocky result, i mean it, as in, i 
> know it, as in i tried it, not only in that case, but also with several 
> converters for plain graphics for the c64. There are quite some tricks 
> to avoid that, either by doing certain dithering tricks, and by counting 
> more on the luninance of a color than its chrominance.
> By the way i am doing a kind of similar thing in the ecmh mode as you 
> described above, more or less a bruteforce attempt with some exclusions 
> to speed up things to a reasonable time. I find out the best 
> backgroundcolors by adding them incrementally, then find the best 2 
> backgroundcolors + colorram for each  8x8 block, as well as the best two 
> chars for that.

> Oh, and as for dithering: Pixels look really big, the 320x200 are 
> displayed on a 14" monitor with an fbas/s-video input. Error diffusion 
> is not the choice, except in some very rare cases, like when you display 
> 320x200 with interlaced colors. Just see here for an example: What you 
> use is ordered dither with certain patterns, and some antialiasing 
> techniques to improve quality. See here:
> http://noname.c64.org/csdb/release/viewpic.php?id=11585&zoom=1
> Even doing a kind of dithering by using certain forms is common, like 
> the clouds in this pic show:
> http://noname.c64.org/csdb/release/viewpic.php?id=25333&zoom=1

iam not sure what these links are supposed to proof, they arent compareing
error diffusion against ordered dither
rather you could look at:
http://www.acs.brockport.edu/~dgusev/Halftone/
which does
ive also scaled it up for the pixels to be huge error diffusion still looks
vastly better than ordered dither

If it does not for C64 it might be due to not considering cross block errors


> 
> >> Making things 
> >> colorful gets really hard then and usually the result looks very blocky.
> >>     
> >
> > did you try above? :)
> >   
> I know that even with less restrictions it looks already ugly, that is, 
> how you easily can differ between handrawn/retouched pics and plain 
> converted pics :-) And having even less color choice won't be very 
> helpful either. Also, the colors from 0..7 are not the colors you need 
> most. For e.g. brown, orange, pink, gray tones, they all are in the 
> upper range from 8..15. So it is hard to get skin tones done, or do a 
> proper gray color fade without them. If you want to hurt your eyes i can 
> calculate some pics with using the lower range only, the limitation is 
> easily done :-)
> So see again at http://www.metalvotze.de/content/videomodes2.php and 
> look closer at the result of the ecmh mode, you see already some of 
> those blocky artefacts at the shoulder, that result of a lack of colors 
> per block.
> > uint8_t color[3], index;
> > for(){
> >     count= read_byte;
> >     color= read_3bytes;
> >     for(count--){
> >         *dst++= read_byte;
> >         *dst++= color[0];
> >         *dst++= color[1];
> >         *dst++= color[2];
> >     }
> > }
> >
> > this would be too slow?
> >   
> To read a byte from the network chip packet buffer and store it to the 
> correct position where it is directly displayable by the videochip, i 
> need 3 instructions if being lazy. (there is of course some overhead for 
> loop handling and fetching a new packet).
> The above code might be as following in 6502 (just hacked fast, not tested):
> 
> ldx $de00 ;count
> lda $de01 ;byte1
> sta buf
> lda $de00 ;byte2
> sta buf+1
> lda $de01 ;byte3
> sta buf+2
> ldy #$00
> ;27 cycles used till here
> more
> lda $de00 ;data is offered 16 bit wide from network chip
> sta dest,y
> iny
> lda buf
> sta dest,y
> iny
> lda buf+1
> sta dest,y
> iny
> lda buf+2
> sta dest,y
> iny
> ;41 more cycles
> dex
> beq out ; need to check for odd value of x
> ;4 more cycles if no branch
> lda $de01 ;fetch next byte (network chip offers next byte automatically 
> when both bytes were read, we are happy to have that feature)
> sta dest,y
> iny
> lda buf
> sta dest,y
> iny
> lda buf+1
> sta dest,y
> iny
> lda buf+2
> sta dest,y
> iny
> ;+41 cycles
> dex
> bne more ; need to check for even value of x
> ;5 more cycles, as we branch hopefully a few times.
> 
> = 118 cycles to load 6 bytes (will get a bit less of course if the loop 
> loops a few times, and i assumed already the buffer being in zeropage, 
> where we can save one cycle when doing lda/sta).
> 
> i can do:
> lda $de00
> sta dest
> lda $de01
> sta dest
> ...
> 
> that is 48 cycles for 6 bytes.
> 
> But more likely i'll do (as it is easier, and still fast enough):
> 
>    ldx #<dest
>    stx a1+1 ;set highbyte of dest in code
>    stx a2+1
>    stx a3+1
>    stx a4+1
>    ldx #$00 ;index is lowbyte of dest
> loop
>    lda $de00
> a1 sta $0000,x
>    inx
>    lda $de01
> a2 sta $0000,x
>    inx
>    lda $de00
> a3 sta $0000,x
>    inx
>    lda $de01
> a4 sta $0000,x
>    inx
>    bne loop
> 
> 47 cycles per loop + 22 cycles for setup
> 
> So over all, i tried many things, even tried RLE and such, it did not 
> bring any improvment, not with the speed i can achieve with loading that 
> simple.
> Convinced now? :-)

no, not at all, first you could use loops like:
(i hope my wild guesses for the ASM are understandable)

    ldx #<dest
    stx a1+1 ;set highbyte of dest in code
    stx a2+1
    stx a3+1
    stx a4+1
    ldx $de01
 loop
    lda $de00
 a1 sta $0000,x
 a2 sta $0004,x
 a3 sta $0008,x
 a4 sta $000C,x
    ldx $de01
    bne loop

to write 4 equal bytes, and a similar loop to write 4 different ones

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

If a bugfix only changes things apparently unrelated to the bug with no
further explanation, that is a good sign that the bugfix is wrong.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20090128/5343a2e1/attachment.pgp>



More information about the ffmpeg-devel mailing list