[FFmpeg-devel] a64 encoder 7th round

Michael Niedermayer michaelni
Tue Feb 3 22:51:31 CET 2009


On Tue, Feb 03, 2009 at 08:28:10PM +0100, Bitbreaker/METALVOTZE wrote:
> Michael Niedermayer schrieb:
> > On Tue, Feb 03, 2009 at 03:22:21PM +0100, Bitbreaker/METALVOTZE wrote:
> >   
> >>> ldx $de00
> >>> lda lut+0,x
> >>> sta dest,y
> >>> lda lut+1,x
> >>> sta dest+64,y
> >>> lda lut+2,x
> >>> sta dest+128,y
> >>> lda lut+3,x
> >>> sta dest+192,y
> >>> iny
> >>>       
> >> Just tested something similar to reconstruct the "compressed" colorram. 
> >> However it spoils of course my option of linear writing and thus things 
> >> need to happen even faster as i write at the lower and upper end of the 
> >> colorram area at the same time. It works out tightly however when i 
> >> start 44 lines before screen ends. Writing endures until i enter the 
> >> upper area again, but ends luckily fast enough (4 lines) before the last 
> >> line of the first 0x100 block of colorram is displayed. So i have to 
> >> take care that i cross no 0x100 border codewise and indexwise, as that 
> >> would add extra cycles and thus trash display. I could however place the 
> >> LUT into zeropage where no extar cycles apply on those conditions. Would 
> >> make things more stable, 
> >>     
> >
> >   
> >> but wastes 64 nice favourite places to store 
> >> values when running out of registers ;-)
> >>     
> >
> > 19 not 64 (see my previous reply for the actual table
> > you need just 2^n + n - 1 not 2^n * n with overlapping entries
> >   
> What i do is stuffing 4 bits of each $0100 block together codec wise
> 
> on c64 i copy 64 byte lut to $0100 (this is the stack) coz i was fed up 
> by wasting so many cycles for just reading a table. Then i can suddenly do:
> 
> ldy #$00
> tsx
> stx $40 ;save stack pointer
> ldx $de00
> txs
> pla
> sta $d800,y
> pla
> sta $d900,y
> pla
> sta $da00,y
> pla
> sta $db00,y
> iny
> 
> ldx $de01
> ...

is this a fully unrolled loop?
because if it is i would expect the following to be faster

tsx
stx $40 ;save stack pointer
ldx $de00
txs
pla
sta $d800
pla
sta $d900
pla
sta $da00
pla
sta $db00

ldx $de01
txs
pla
sta $d801
pla
sta $d901
pla
sta $da01
pla
sta $db01
...


or the more obvious:

tsx
stx $40 ;save stack pointer
ldx $de00
txs
pla
sta $d800
pla
sta $d801
pla
sta $d802
pla
sta $d803

ldx $de01
txs
pla
sta $d804
pla
sta $d805
pla
sta $d806
pla
sta $d807
...

> 
> when finished, restore stack pointer
> 
> this allows me to save 2 more cycle per 4 byte lookup as i can just pull 
> data from stack within 3 cycles and even get the stackpointer 
> incremented for free by that. To bad that stack area is fixed, else i'd 
> do it the other way round by pushing bytes on the stack.
> This is on the one hand dirty, but works fine, the real stackpointer and 
> data is far away from my LUT i placed in the stack, so no collisions 
> expected. My lookup consists of 16 entries each 4 bytes. The size does 
> not hurt, 6502 code grew anyway by 0x500 bytes by the latest 
> optimizations (twice the size now) ;-)
> 
> LUT looks like:
> 
> 8,8,8,8
> 8,8,8,9
> 8,8,9,8
> 8,8,9,9
> ...

8, 8, 8, 8, 9, 8, 8, 9, 9, 8, 9, 8, 9, 9, 9, 9, 8, 8, 8

19 elements
why is this better? 64 bytes dont matter ...

why not do it with 5 instead of 4 (36 vs. 160)
8, 8, 8, 8, 8, 9, 8, 8, 9, 8, 9, 9, 8, 8, 9, 9, 9, 9, 9, 8, 8, 8, 9, 9, 8, 9, 9, 9, 8, 9, 8, 9, 8, 8, 8, 8

or 7 (134 vs. 896) you dont have 896 if i understood you correctly, but the
bigger table should be faster ...

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Those who are too smart to engage in politics are punished by being
governed by those who are dumber. -- Plato 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20090203/719d3387/attachment.pgp>



More information about the ffmpeg-devel mailing list