As Jester suggests, as a first optimisation just repeat the lda
, and
, sta
and dey
eight times. Eliminate the cpy
and bne
. That'll save 103 cycles immediately. Even if you want to keep the formal loop, notice that dey
sets the zero flag so you don't need the cpy
.
As a second optimisation, consider a compiled sprite. Instead of performing the read from sprite, x
, you'd have those values coded directly into your routine, making a distinct routine for each sprite. That'd cut another 16 cycles.
That being said, your lda
would be 4 cycles in an aligned table, not 3. So there are 8 you haven't accounted for. Meaning that unrolled plus specialised to your sprite = 102 cycles (having omitted the final dey
).
Without knowing the C64 architecture and/or what the rest of your code does, if whomever ingests SUPERIMPOSED
can do so from the stack page, consider writing output to the stack rather than via indexed addressing. Just load s
with an appropriate seed value and store new results via pha
. That'll save two cycles per store at the cost of 12 additional cycles of setup and restore.
Following on from that thought, if you had freedom in how these tables look then consider switching their format — instead of one table that holds all eight bytes of TILESET
, use eight tables, each of which holds one byte of it. That'd remove the need to adjust y
in the loop; just use a different target table in each unrolled iteration.
Supposing both TILESET
and SUPERIMPOSED
can be eight tables that gets you down to:
LDA TILESET1, x
AND #<value>
STA SUPERIMPOSED1, x ; * 8
[... LDA TILESET2, x ...]
... which is a total of 88 cycles. If SUPERIMPOSED
is linear but in the stack page then:
TSX
TXA
LDX #newdest
TXS
TAX ; adds 10
LDA TILESET1, y
AND #<value>
PHA ; * 8
[... LDA TILESET2, y ...]
TXS ; adds 2
... which is 84 cycles.
Late addition:
If you're willing to premultiply the index in x
by 8, effectively reducing your indexable range to 32 tiles, then you can proceed filling a linear output array without adjusting y, as per:
LDA TILESET, x
AND #<value1>
STA SUPERIMPOSED, x
LDA TILESET+1, x
AND #<value2>
STA SUPERIMPOSED+1, x
... etc ...
So you'd need eight copies of that routine with different table base addresses still to be able to hit 256 output tiles. Supposing you have 20 sprites, that makes a total of 20*8 = 160 copies of your sprite plotting routine, each of which is likely to be of the order of 100 bytes, so you're spending about 16kb.
If your game is much heavier on one kind of sprite than on others — e.g. it's usually two or three spaceships shooting thousands of bullets at each other — then obviously you can optimise very selectively and keep that total footprint down.
x
for indexing buty
for loop? – Charissa