Talk:CRC-32
From BeebWiki
Contents |
Optimisation of 6502 code
JGH version
J.G.Harston wrote this code in June 2011 after rewriting the CRC code in BBCZip. This takes 79 bytes and typically takes 0.56s to check 2.5K of data.
.crc32
LDY #0 :\ Point to first byte
.bytelp
LDA (addr),Y:EOR crc:STA crc :\ EOR byte into CRC bottom byte
LDX #8 :\ Prepare to rotate CRC 8 bits
CLC :\ Having CLC here instead of inside the
:\ loop means it is only executed once per
:\ byte instead of once per bit.
.rotlp
ROR crc+3:ROR crc+2 :\ Rotate CRC
ROR crc+1:ROR crc+0
BCC clear :\ b0 was zero
LDA crc+3:EOR #&ED:STA crc+3 :\ CRC=CRC EOR &EDB88320, ZIP ploynomic
LDA crc+2:EOR #&B8:STA crc+2
LDA crc+1:EOR #&83:STA crc+1
LDA crc+0:EOR #&20:STA crc+0
CLC :\ CLC ready for next bit
.clear
DEX:BNE rotlp :\ Loop for 8 bits
INY:BNE next:INC ad+1 :\ Step to next byte
.next
LDA num+0:SBC #0:STA num+0 :\ num=num-1, taking advantage of Cy being 0
LDA num+1:SBC #0:STA num+1
ORA num:BNE bytelp :\ Loop until num=0
RTS
Removed a CLC
MikeGreg Cook noticed that using LSR instead of ROR at the start of rotlp means that the loop doesn't have to be entered with Carry cleared, so the CLC just before .rotlp and .clear are not needed. The CLC at .clear is still needed, but moved after .clear for the SBC to subtract by one. However, the other CLC can be removed, saving one byte and speeding up from typically 0.56s to 0.55s to check 2.5K of data in 78 bytes of code.
.crc32 LDY #0 :\ Point to first byte .bytelp LDA (addr),Y:EOR crc:STA crc :\ EOR byte into CRC bottom byte LDX #8 :\ Prepare to rotate CRC 8 bits .rotlp LSR crc+3:ROR crc+2 :\ Rotate CRC clearing bit 31 ROR crc+1:ROR crc+0 BCC clear :\ b0 was zero LDA crc+3:EOR #&ED:STA crc+3 :\ CRC=CRC EOR &EDB88320, ZIP ploynomic LDA crc+2:EOR #&B8:STA crc+2 LDA crc+1:EOR #&83:STA crc+1 LDA crc+0:EOR #&20:STA crc+0 .clear DEX:BNE rotlp :\ Loop for 8 bits INY:BNE next:INC addr+1 :\ Step to next byte .next CLC :\ Clear Cy to borrow LDA num+0:SBC #0:STA num+0 :\ num=num-1 LDA num+1:SBC #0:STA num+1 ORA num:BNE bytelp :\ Loop until num=0 RTS
Hold part of CRC in A
Mike'sGreg's next optimisation was to realise that using the little-used (zp,X) addressing mode to read the bytes from memory releases the Y register and, with some jiggling, the A register. So, part of the CRC can be kept in the A register most of the time. This reduced the code to 65 bytes and speeds it up to typically 0.51s to check 2.5K of data.
.crc32 .bytelp LDA crc+0 :\ Load CRC bottom byte LDX #8 :\ Prepare to rotate CRC 8 bits EOR (addr-8 AND &FF,X) :\ EOR byte into CRC bottom byte .rotlp LSR crc+3:ROR crc+2 :\ Rotate CRC clearing bit 31 ROR crc+1:ROR A BCC clear :\ b0 was zero TAY :\ Hold CRC low byte in Y for a moment LDA crc+3:EOR #&ED:STA crc+3 :\ CRC=CRC EOR &EDB88320, ZIP polynomic LDA crc+2:EOR #&B8:STA crc+2 LDA crc+1:EOR #&83:STA crc+1 TYA:EOR #&20 :\ Getting CRC low byte back from Y .clear DEX:BNE rotlp :\ Loop for 8 bits INC addr:BNE next:INC addr+1 :\ Step to next byte .next STA crc+0 :\ Store CRC bottom byte LDA num+0:SBC #0:STA num+0 :\ num=num-1 LDA num+1:SBC #0:STA num+1 ORA num:BNE bytelp :\ Loop until num=0 RTS
DEC instead of SBC for counter
MikeGreg suggested the following code that uses DECs instead of SBCs for the loop counter. However, I'm fairly sure that DEC num+0:BNE bytelp:DEC num+1:BNE bytelp does not work. It's not a reflection of INC low/INC high as you're wrapping when you go from 1 to 0, not when you go from 0 to -1. Testing with CRC32 BASIC test program confirms this.
.crc32 LDA num:BNE load :\ Increment bytes of num INC num+1 :\ so that Z=1 from DEC means borrow .load LDA crc+0 :\ Load CRC bottom byte .bytelp LDX #8 :\ Prepare to rotate CRC 8 bits EOR (addr-8 AND &FF,X) :\ EOR byte into CRC bottom byte .rotlp LSR crc+3:ROR crc+2 :\ Rotate CRC clearing bit 31 ROR crc+1:ROR A BCC clear :\ b0 was zero TAY LDA crc+3:EOR #&ED:STA crc+3 :\ CRC=CRC EOR &EDB88320, ZIP polynomic LDA crc+2:EOR #&B8:STA crc+2 LDA crc+1:EOR #&83:STA crc+1 TYA:EOR #&20 .clear DEX:BNE rotlp :\ Loop for 8 bits INC addr:BNE next:INC addr+1 :\ Step to next byte .next DEC num+0:BNE bytelp :\ Loop until num=0 DEC num+1:BNE bytelp STA crc+0 :\ Store CRC bottom byte RTS
Obelisk 6502 Algorithms show how to do a 16-bit decrement, which gives the following:
.next
STA crc+0 :\ Store CRC bottom byte
:\ Now do a 16-dit decrement
LDA num+0:BNE nextlo :\ If num.lo<>0, now wrap, jump to decrement it
DEC num+1 :\ Wrapping from &xx00 to &xxFF, so decrement high byte
.nextlo
DEC num+0:BNE bytelp :\ Decrement num.lo, loop if not zero
LDA num+1:BNE bytelp :\ Loop until num=0
RTS
This decreases typical speed down to 0.49s in 63 bytes of code.
- In 1983, Mike Cook was churning out "Body Building" articles for the Micro user among other electronic wizardry. Meanwhile Greg Cook, the author of the above optimisations, was far too young to wield a soldering iron or for that matter, grow facial hair.
- Thanks for your kind annotation. I was unaware of the resources at Obelisk. – beardo 17:33, 20 June 2011 (UTC)
- Edit: I've now tested the original optimisation (first snippet in this section) and it works after the following modification: for
.crc32 LDA num:BNE load
- read
.crc32 LDA num+0:BEQ load
- Where the value of num is not used in the loop (as here), we are free to transpose the byte values as we wish. In a loop of the WHILE...ENDWHILE type, we would like DEC to indicate a 'borrow' on the original byte value, and a cascade of DECs to flag passage through zero so that, as the decrement precedes the test, an initial count of zero causes an immediate exit. By pre-incrementing each byte of num, this can be achieved.
- Here though, crc32 performs a FOR...NEXT-type loop where passage through 1 terminates the loop (and num = 0 results in the greatest number of iterations.) In this case the low byte of num is left as-is, and the high byte incremented unless the low byte is zero. In my submitted optimisation the correction is achieved by the first three instructions, corrected to the following:
.crc32 LDA num+0:BEQ load :\ Increment bytes of num INC num+1 :\ so that Z=1 from DEC means borrow .load [...]
- – beardo 22:01, 20 June 2011 (UTC)

