Sometimes the simplest way to write something in assembly code isn't the best. All of your resources are limited: CPU speed, ROM size, RAM space, register use. You can rewrite code to use those resources more efficiently (sometimes by trading one for another). Most of these tricks come from [Jeff's GB Assembly Code Tips v1.0](http://www.devrs.com/gb/files/asmtips.txt), [WikiTI's Z80 Optimization page](http://wikiti.brandonw.net/index.php?title=Z80_Optimization), and [z80 Heaven's optimization tutorial](http://z80-heaven.wikidot.com/optimization). (Note that Z80 assembly is *not* the same as GBZ80; it has more registers and some different instructions.) WikiTI's advice fully applies here: > Note that the following tricks act much like a [peephole optimizer](https://en.wikipedia.org/wiki/Peephole_optimization) and are the last optimization step: remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code. > > Also note that nearly every trick turns the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code. > > Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on their use; comments warn about them. Some tricks apply to other cases, but again you have to be careful. > > There are some tricks that are nothing more than the correct use of the available instructions on the Z80. Keeping an [instruction set summary](https://rednex.github.io/rgbds/gbz80.7.html) helps to visualize what you can do during coding. ## Contents - [8-bit registers](#8-bit-registers) - [Set `a` to 0](#set-a-to-0) - [Increment or decrement `a`](#increment-or-decrement-a) - [Invert the bits of `a`](#invert-the-bits-of-a) - [Rotate the bits of `a`](#rotate-the-bits-of-a) - [Set `a` to some constant minus `a`](#set-a-to-some-constant-minus-a) - [Set `a` to one constant or another depending on the carry flag](#set-a-to-one-constant-or-another-depending-on-the-carry-flag) - [Shift `a` right by 3 bits](#shift-a-right-by-3-bits) - [Set `a` to some value plus carry](#set-a-to-some-value-plus-carry) - [Load from HRAM to `a` or from `a` to HRAM](#load-from-hram-to-a-or-from-a-to-hram) - [16-bit registers](#16-bit-registers) - [Multiply `hl` by 2](#multiply-hl-by-2) - [Add `a` to a 16-bit register](#add-a-to-a-16-bit-register) - [Add `a` to an address](#add-a-to-an-address) - [Increment or decrement a 16-bit register](#increment-or-decrement-a-16-bit-register) - [Load from an address to `hl`](#load-from-an-address-to-hl) - [Exchange two 16-bit registers](#exchange-two-16-bit-registers) - [Load two constants into a register pair](#load-two-constants-into-a-register-pair) - [Load a constant into `[hl]`](#load-a-constant-into-hl) - [Increment or decrement `[hl]`](#increment-or-decrement-hl) - [Load a constant into `[hl]` and increment or decrement `hl`](#load-a-constant-into-hl-and-increment-or-decrement-hl) - [Branching (control flow)](#branching-control-flow) - [Relative jumps](#relative-jumps) - [Compare `a` to 0](#compare-a-to-0) - [Compare `a` to 1](#compare-a-to-1) - [Compare `a` to 255](#compare-a-to-255) - [Compare `a` to 0 after masking it](#compare-a-to-0-after-masking-it) - [Subroutines (functions)](#subroutines-functions) - [Tail call optimization](#tail-call-optimization) - [Call `hl`](#call-hl) - [Inlining](#inlining) - [Fallthrough](#fallthrough) - [Conditional fallthrough](#conditional-fallthrough) - [Call `rst $38` depending on a flag](#call-rst-38-depending-on-a-flag) - [Jump and lookup tables](#jump-and-lookup-tables) - [Chain comparisons](#chain-comparisons) ## 8-bit registers ### Set `a` to 0 Don't do: ```asm ld a, 0 ; 2 bytes, 2 cycles, no changes to flags ``` But do: ```asm xor a ; 1 byte, 1 cycle, sets flags C to 0 and Z to 1 ``` Or do: ```asm sub a ; 1 byte, 1 cycle, sets flags C to 0 and Z to 1 ``` Don't use the optimized versions if you need to preserve flags. As such, `ld a, 0` must be left intact in the code below: ```asm ld a, [wIsTrainerBattle] and a ; sets flag Z if [wIsTrainerBattle] is 0 ld a, 0 ; sets a to 0 without affecting Z jr nz, .is_trainer_battle ``` ### Increment or decrement `a` When possible, avoid doing: ```asm add 1 ; 2 bytes, 2 cycles; sets carry for -1 to 0 overflow ``` ```asm sub 1 ; 2 bytes, 2 cycles; sets carry for 0 to -1 underflow ``` If you don't need to set the carry flag, then do: ```asm inc a ; 1 byte, 1 cycle ``` ```asm dec a ; 1 byte, 1 cycle ``` ### Invert the bits of `a` Don't do: ```asm xor $ff ; 2 bytes, 2 cycles ``` But do: ```asm cpl ; 1 byte, 1 cycle ``` ### Rotate the bits of `a` Don't do: ```asm rl a ; 2 bytes, 2 cycles ``` ```asm rlc a ; 2 bytes, 2 cycles ``` ```asm rr a ; 2 bytes, 2 cycles ``` ```asm rrc a ; 2 bytes, 2 cycles ``` But do: ```asm rla ; 1 byte, 1 cycle ``` ```asm rlca ; 1 byte, 1 cycle ``` ```asm rra ; 1 byte, 1 cycle ``` ```asm rrca ; 1 byte, 1 cycle ``` The exception is if you need to set the zero flag when the operation results in 0 for `a`; the two-byte operations can set `z`, the one-byte operations cannot. ### Set `a` to some constant minus `a` Don't do: ```asm ; 4 bytes, 4 cycles ld b, a ld a, FOOBAR sub b ``` But do: ```asm ; 3 bytes, 3 cycles cpl add FOOBAR + 1 ``` ("What's [foobar](https://en.wikipedia.org/wiki/Foobar)?") ### Set `a` to one constant or another depending on the carry flag (The example sets `a` to `FOO` if the carry flag is set (`c`), or `BAR` is the carry flag is not set (`nc`).) Don't do: ```asm ; 6 bytes, 6 or 7 cycles ld a, FOO jr c, .carry ld a, BAR .carry ``` And don't do: ```asm ; 6 bytes, 6 or 7 cycles ld a, BAR jr nc, .no_carry ld a, FOO .no_carry ``` And if either is 0, don't do: ```asm ; 5 bytes, 5 or 6 cycles ld a, FOO ; nor BAR jr c, .carry ; nor jr nc xor a .carry ``` But do: ```asm ; 5 bytes, 5 cycles sbc a ; if carry, then $ff, else 0 and FOO - BAR ; $ff becomes FOO - BAR, 0 stays 0 add BAR ; FOO - BAR becomes FOO, 0 becomes BAR ``` Or do: ```asm ; 5 bytes, 5 cycles sbc a ; if carry, then $ff, else 0 and FOO ^ BAR ; $ff becomes FOO ^ BAR, 0 stays 0 xor BAR ; FOO ^ BAR becomes FOO, 0 becomes BAR ``` If `FOO` is 0 (i.e. set `a` to 0 if carry), then do: ```asm ; 4 bytes, 4 cycles ccf ; invert carry flag sbc a ; if originally carry, then 0, else $ff and BAR ; 0 stays 0, $ff becomes BAR ``` If `BAR` is 0 (i.e. set `a` to 0 if not carry), then do: ```asm ; 3 bytes, 3 cycles sbc a ; if carry, then $ff, else 0 and FOO ; $ff becomes FOO, 0 stays 0 ``` If `FOO` equals `BAR - 1`, then do: ```asm ; 3 bytes, 3 cycles sbc a ; if carry, then $ff aka -1, else 0 add BAR ; -1 becomes BAR - 1 aka FOO, 0 becomes BAR ``` If `FOO` equals `BAR - 2`, then do: ```asm ; 3 bytes, 3 cycles sbc a ; if carry, then $ff aka -1, else 0; doesn't change the carry flag sbc -BAR ; -1 becomes BAR - 2 aka FOO, 0 becomes BAR ``` If `FOO` is 0 and `BAR` is 1 (i.e. set `a` to 0 if carry, or 1 if not carry), then do: ```asm ; 2 bytes, 2 cycles sbc a ; if carry, then $ff aka -1, else 0 inc a ; -1 becomes 0, 0 becomes 1 ``` ### Shift `a` right by 3 bits Don't do: ```asm ; 6 bytes, 6 cycles srl a srl a srl a ``` But do: ```asm ; 5 bytes, 5 cycles rrca rrca rrca and %00011111 ``` ### Set `a` to some value plus carry (The example uses `b` and `c`, but any registers besides `a` would also work, including `[hl]`.) Don't do: ```asm ; 4 bytes, 4 cycles ld b, a ld a, c adc 0 ``` And don't do: ```asm ; 4 bytes, 4 cycles ld b, a ld a, 0 adc c ``` But do: ```asm ; 3 bytes, 3 cycles ld b, a adc c sub b ``` Also, don't do: ```asm ; 5 bytes, 5 cycles ld b, a ld a, N adc 0 ``` And don't do: ```asm ; 5 bytes, 5 cycles ld b, a ld a, 0 adc N ``` But do: ```asm ; 4 bytes, 4 cycles ld b, a adc N sub b ``` ### Load from HRAM to `a` or from `a` to HRAM Don't do: ```asm ld a, [hFoo] ; 3 bytes, 4 cycles ``` ```asm ld [hFoo], a ; 3 bytes, 4 cycles ``` But do: ```asm ldh a, [hFoo] ; 2 bytes, 3 cycles ``` ```asm ldh [hFoo], a ; 2 bytes, 3 cycles ``` ## 16-bit registers ### Multiply `hl` by 2 Don't do: ```asm ; 6 bytes, 6 cycles sla l rl h ``` But do: ```asm add hl, hl ; 1 byte, 2 cycles ``` ### Add `a` to a 16-bit register (The example uses `hl`, but `bc` or `de` would also work.) Don't do: ```asm ; 6 bytes, 6 cycles add l ld l, a ld a, 0 adc h ld h, a ``` and don't do: ```asm ; 6 bytes, 6 cycles add l ld l, a ld a, h adc 0 ld h, a ``` But do: ```asm ; 5 bytes, 5 or 6 cycles add l ld l, a jr nc, .no_carry inc h .no_carry: ``` Or better, do: ```asm ; 5 bytes, 5 cycles add l ld l, a adc h sub l ld h, a ``` Or if you can spare another 16-bit register and want to optimize for size over speed, do: ```asm ; 4 bytes, 5 cycles ld d, 0 ld e, a add hl, de ``` ### Add `a` to an address (The example uses `hl`, but `bc` or `de` would also work.) Don't do: ```asm ; 7 bytes, 8 cycles; uses another 16-bit register ld e, a ld d, 0 ld hl, Address add hl, de ``` But do: ```asm ; 7 bytes, 7 cycles add LOW(Address) ld l, a adc HIGH(Address) sub l ld h, a ``` ### Increment or decrement a 16-bit register When possible, avoid doing: ```asm inc hl ; 1 byte, 2 cycles ``` ```asm dec hl ; 1 byte, 2 cycles ``` If the low byte *definitely* won't overflow, then do: ```asm inc l ; 1 byte, 1 cycle ``` ```asm dec l ; 1 byte, 1 cycle ``` This is applicable, for instance, if you're reading a data table via `hl` one byte at a time, it has no more than 256 entries, and it's in its own `SECTION` which has been `ALIGN`ed to 8 bits. It's unlikely to apply to pokecrystal's existing systems. ### Load from an address to `hl` Don't do: ```asm ; 8 bytes, 10 cycles ld a, [Address] ld l, a ld a, [Address+1] ld h, a ``` But do: ```asm ; 6 bytes, 8 cycles ld hl, Address ld a, [hli] ld h, [hl] ld l, a ``` And don't do: ```asm ; 8 bytes, 10 cycles ld a, [Address] ld h, a ld a, [Address+1] ld l, a ``` But do: ```asm ; 6 bytes, 8 cycles ld hl, Address + 1 ld a, [hld] ld h, [hl] ld l, a ``` ### Exchange two 16-bit registers (The example uses `hl` and `de`, but any pair of `bc`, `de`, or `hl` would also work.) If you care about speed: ```asm ; 6 bytes, 6 cycles ld a, d ld d, h ld h, a ld a, e ld e, l ld l, a ``` If you care about size: ```asm ; 4 bytes, 9 cycles push de ld d, h ld e, l pop hl ``` ### Load two constants into a register pair (The example uses `bc`, but `hl` or `de` would also work.) Don't do: ```asm ; 4 bytes, 4 cycles ld b, FOO ld c, BAR ``` But do: ```asm ld bc, FOO << 8 | BAR ; 3 bytes, 3 cycles ``` Or better, use the `lb` macro in [macros/code.asm](../blob/master/macros/code.asm): ```asm lb bc, FOO, BAR ; 3 bytes, 3 cycles ``` ### Load a constant into `[hl]` Don't do: ```asm ; 3 bytes, 4 cycles ld a, FOOBAR ld [hl], a ``` But do: ```asm ld [hl], FOOBAR ; 2 bytes, 3 cycles ``` ### Increment or decrement `[hl]` Don't do: ```asm ; 3 bytes, 5 cycles ld a, [hl] inc a ld [hl], a ``` ```asm ; 3 bytes, 5 cycles ld a, [hl] dec a ld [hl], a ``` But do: ```asm inc [hl] ; 1 bytes, 3 cycles ``` ```asm dec [hl] ; 1 bytes, 3 cycles ``` ### Load a constant into `[hl]` and increment or decrement `hl` Don't do: ```asm ; 2 bytes, 4 cycles ld [hl], a inc hl ``` ```asm ; 2 bytes, 4 cycles ld [hl], a dec hl ``` But do: ```asm ld [hli], a ; 1 bytes, 2 cycles ``` ```asm ld [hld], a ; 1 bytes, 2 cycles ``` ## Branching (control flow) ### Relative jumps Don't do: ```asm jp Somewhere ; 3 bytes, 4 cycles ``` But do: ```asm jr Somewhere ; 2 bytes, 3 cycles ``` This only applies if `Somewhere` is within ±127 bytes of the jump. ### Compare `a` to 0 Don't do: ```asm cp 0 ; 2 bytes, 2 cycles ``` But do: ```asm or a ; 1 byte, 1 cycle ``` Or do: ```asm and a ; 1 byte, 1 cycle ``` ### Compare `a` to 1 ```asm cp 1 ; 2 bytes, 2 cycles ``` If you don't care about the value in `a`: ```asm dec a ; 1 byte, 1 cycle, decrements a ``` Note that you can still do `inc a` afterwards, which is one cycle faster if the jump is taken. Compare: ```asm cp 1 jr z, .equals1 ``` with: ```asm dec a jr z, .equals1 inc a ``` ### Compare `a` to 255 (255, or $FF in hexadecimal, is the same as −1 due to [two's complement](https://en.wikipedia.org/wiki/Two%27s_complement).) ```asm cp $ff ; 2 bytes, 2 cycles ``` If you don't care about the value in `a`: ```asm inc a ; 1 byte, 1 cycle, increments a ``` Note that you can still do `dec a` afterwards, which is one cycle faster if the jump is taken. Compare: ```asm cp $ff jr z, .equals255 ``` with: ```asm inc a jr z, .equals255 dec a ``` ### Compare `a` to 0 after masking it Don't do: ```asm ; 3 bytes, 3 cycles; sets zero flag if a == 0 and MASK and a ``` But do: ```asm and MASK ; 2 bytes, 2 cycles; sets zero flag if a == 0 ``` ## Subroutines (functions) ### Tail call optimization Don't do: ```asm ; 4 bytes, 10 cycles call Function ret ``` But do: ```asm jp Function ; 3 bytes, 4 cycles ``` ### Call `hl` Don't do: ```asm ; 5 bytes, 8 cycles (some code) ld de, .return push de jp hl .return: (some more code) ``` But do: ```asm ; 3 bytes, 6 cycles ; (4 bytes, 7 cycles, counting the definition of _hl_) (some code) call _hl_ (some more code) ``` `_hl_` is a routine already defined in [home/call_regs.asm](../blob/master/home/call_regs.asm): ```asm _hl_:: jp hl ``` ### Inlining Don't do: ```asm ; 4 additional bytes, 10 additional cycles call GetOffset ... GetOffset: (some code) ret ``` if `GetOffset` is only called a handful of times. Instead, do: ```asm ; GetOffset (some code) ``` You can set `(some code)` apart with blank lines and put a comment on top to make its self-contained nature clear without the extra `call` and `ret`. ### Fallthrough Don't do: ```asm ... call Function ret Function: (some code) ret ``` And don't do: ```asm ... jp Function Function: (some code) ret ``` But do: ```asm ... ; fallthrough Function: (some code) ret ``` You can still `call Function` elsewhere, but one tail call can be optimized into a fallthrough. ## Conditional fallthrough (The example uses `z`, but `nz`, `c`, or `nc` would also work.) Don't do: ```asm (some code) jr z, .foo jr .bar .foo (foo code) .bar (bar code) ``` But do: ```asm (some code) jr nz, .bar ; fallthrough .foo (foo code) .bar (bar code) ``` ### Call `rst $38` depending on a flag (The example uses `z`, but `nz`, `c`, or `nc` would also work.) Don't do: ```asm ; 5 bytes, 3 or 14 cycles call z, RstVector38 ... RstVector38: rst $38 ret ``` And don't do: ```asm ; 3 bytes, 2 or 7 cycles jr nz, .no_rst_38 rst $38 .no_rst_38 ... ``` But do: ```asm ; 2 bytes, 2 or 7 cycles jr z, @ - 1 ; the byte for @ - 1 is $ff, which is the opcode for rst $38 ... ``` ## Jump and lookup tables ### Chain comparisons Don't do: ```asm cp 1 jr z, .equals1 cp 2 jr z, .equals2 cp 3 jr z, .equals3 ... ``` But do: ```asm dec a jr z, .equals1 dec a jr z, .equals2 dec a jr z, .equals3 ... ``` Or do: ```asm dec a ld hl, .jumptable ld e, a ld d, 0 add hl, de add hl, de ld a, [hli] ld h, [hl] ld l, a jp hl .jumptable: dw .equals1 dw .equals2 dw .equals3 ... ``` Or better, do: ```asm dec a ld hl, .jumptable rst JumpTable ... .jumptable: dw .equals1 dw .equals2 dw .equals3 ... ```