Talk:CRC-32

From BeebWiki

Jump to: navigation, search

Contents

Optimisation of 6502 code

JGH version

J.G.Harston wrote this code in June 2011 after rewriting the CRC code in BBCZip. This takes 79 bytes and typically takes 0.56s to check 2.5K of data.

.crc32
LDY #0                       :\ Point to first byte
.bytelp
LDA (addr),Y:EOR crc:STA crc :\ EOR byte into CRC bottom byte
LDX #8                       :\ Prepare to rotate CRC 8 bits
CLC                          :\ Having CLC here instead of inside the
                             :\ loop means it is only executed once per
                             :\ byte instead of once per bit.
.rotlp
ROR crc+3:ROR crc+2          :\ Rotate CRC
ROR crc+1:ROR crc+0
BCC clear                    :\ b0 was zero
LDA crc+3:EOR #&ED:STA crc+3 :\ CRC=CRC EOR &EDB88320, ZIP ploynomic
LDA crc+2:EOR #&B8:STA crc+2
LDA crc+1:EOR #&83:STA crc+1
LDA crc+0:EOR #&20:STA crc+0
CLC                          :\ CLC ready for next bit
.clear
DEX:BNE rotlp                :\ Loop for 8 bits
INY:BNE next:INC ad+1        :\ Step to next byte
.next
LDA num+0:SBC #0:STA num+0   :\ num=num-1, taking advantage of Cy being 0
LDA num+1:SBC #0:STA num+1
ORA num:BNE bytelp           :\ Loop until num=0
RTS

Removed a CLC

MikeGreg Cook noticed that using LSR instead of ROR at the start of rotlp means that the loop doesn't have to be entered with Carry cleared, so the CLC just before .rotlp and .clear are not needed. The CLC at .clear is still needed, but moved after .clear for the SBC to subtract by one. However, the other CLC can be removed, saving one byte and speeding up from typically 0.56s to 0.55s to check 2.5K of data in 78 bytes of code.

.crc32
LDY #0                       :\ Point to first byte
.bytelp
LDA (addr),Y:EOR crc:STA crc :\ EOR byte into CRC bottom byte
LDX #8                       :\ Prepare to rotate CRC 8 bits
.rotlp
LSR crc+3:ROR crc+2          :\ Rotate CRC clearing bit 31
ROR crc+1:ROR crc+0
BCC clear                    :\ b0 was zero
LDA crc+3:EOR #&ED:STA crc+3 :\ CRC=CRC EOR &EDB88320, ZIP ploynomic
LDA crc+2:EOR #&B8:STA crc+2
LDA crc+1:EOR #&83:STA crc+1
LDA crc+0:EOR #&20:STA crc+0
.clear
DEX:BNE rotlp                :\ Loop for 8 bits
INY:BNE next:INC addr+1      :\ Step to next byte
.next
CLC                          :\ Clear Cy to borrow
LDA num+0:SBC #0:STA num+0   :\ num=num-1
LDA num+1:SBC #0:STA num+1
ORA num:BNE bytelp           :\ Loop until num=0
RTS

Hold part of CRC in A

Mike'sGreg's next optimisation was to realise that using the little-used (zp,X) addressing mode to read the bytes from memory releases the Y register and, with some jiggling, the A register. So, part of the CRC can be kept in the A register most of the time. This reduced the code to 65 bytes and speeds it up to typically 0.51s to check 2.5K of data.

.crc32
.bytelp
LDA crc+0                    :\ Load CRC bottom byte
LDX #8                       :\ Prepare to rotate CRC 8 bits
EOR (addr-8 AND &FF,X)       :\ EOR byte into CRC bottom byte
.rotlp
LSR crc+3:ROR crc+2          :\ Rotate CRC clearing bit 31
ROR crc+1:ROR A
BCC clear                    :\ b0 was zero
TAY                          :\ Hold CRC low byte in Y for a moment
LDA crc+3:EOR #&ED:STA crc+3 :\ CRC=CRC EOR &EDB88320, ZIP polynomic
LDA crc+2:EOR #&B8:STA crc+2
LDA crc+1:EOR #&83:STA crc+1
TYA:EOR #&20                 :\ Getting CRC low byte back from Y
.clear
DEX:BNE rotlp                :\ Loop for 8 bits
INC addr:BNE next:INC addr+1 :\ Step to next byte
.next
STA crc+0                    :\ Store CRC bottom byte
LDA num+0:SBC #0:STA num+0   :\ num=num-1
LDA num+1:SBC #0:STA num+1
ORA num:BNE bytelp           :\ Loop until num=0
RTS

DEC instead of SBC for counter

MikeGreg suggested the following code that uses DECs instead of SBCs for the loop counter. However, I'm fairly sure that DEC num+0:BNE bytelp:DEC num+1:BNE bytelp does not work. It's not a reflection of INC low/INC high as you're wrapping when you go from 1 to 0, not when you go from 0 to -1. Testing with CRC32 BASIC test program confirms this.

.crc32
LDA num:BNE load             :\ Increment bytes of num
INC num+1                    :\ so that Z=1 from DEC means borrow
.load
LDA crc+0                    :\ Load CRC bottom byte
.bytelp
LDX #8                       :\ Prepare to rotate CRC 8 bits
EOR (addr-8 AND &FF,X)       :\ EOR byte into CRC bottom byte
.rotlp
LSR crc+3:ROR crc+2          :\ Rotate CRC clearing bit 31
ROR crc+1:ROR A
BCC clear                    :\ b0 was zero
TAY
LDA crc+3:EOR #&ED:STA crc+3 :\ CRC=CRC EOR &EDB88320, ZIP polynomic
LDA crc+2:EOR #&B8:STA crc+2
LDA crc+1:EOR #&83:STA crc+1
TYA:EOR #&20
.clear
DEX:BNE rotlp                :\ Loop for 8 bits
INC addr:BNE next:INC addr+1 :\ Step to next byte
.next
DEC num+0:BNE bytelp         :\ Loop until num=0
DEC num+1:BNE bytelp
STA crc+0                    :\ Store CRC bottom byte
RTS

Obelisk 6502 Algorithms show how to do a 16-bit decrement, which gives the following:

.next
STA crc+0                    :\ Store CRC bottom byte
                             :\ Now do a 16-dit decrement
LDA num+0:BNE nextlo         :\ If num.lo<>0, now wrap, jump to decrement it
DEC num+1                    :\ Wrapping from &xx00 to &xxFF, so decrement high byte
.nextlo
DEC num+0:BNE bytelp         :\ Decrement num.lo, loop if not zero
LDA num+1:BNE bytelp         :\ Loop until num=0
RTS

This decreases typical speed down to 0.49s in 63 bytes of code.

In 1983, Mike Cook was churning out "Body Building" articles for the Micro user among other electronic wizardry. Meanwhile Greg Cook, the author of the above optimisations, was far too young to wield a soldering iron or for that matter, grow facial hair.
Thanks for your kind annotation. I was unaware of the resources at Obelisk. – beardo 17:33, 20 June 2011 (UTC)
Edit: I've now tested the original optimisation (first snippet in this section) and it works after the following modification: for
.crc32
LDA num:BNE load
read
.crc32
LDA num+0:BEQ load
Where the value of num is not used in the loop (as here), we are free to transpose the byte values as we wish. In a loop of the WHILE...ENDWHILE type, we would like DEC to indicate a 'borrow' on the original byte value, and a cascade of DECs to flag passage through zero so that, as the decrement precedes the test, an initial count of zero causes an immediate exit. By pre-incrementing each byte of num, this can be achieved.
Here though, crc32 performs a FOR...NEXT-type loop where passage through 1 terminates the loop (and num = 0 results in the greatest number of iterations.) In this case the low byte of num is left as-is, and the high byte incremented unless the low byte is zero. In my submitted optimisation the correction is achieved by the first three instructions, corrected to the following:
.crc32
LDA num+0:BEQ load           :\ Increment bytes of num
INC num+1                    :\ so that Z=1 from DEC means borrow
.load
[...]
beardo 22:01, 20 June 2011 (UTC)