Sometimes the simplest way to write something in assembly code isn't the best. All of your resources are limited: CPU speed, ROM size, RAM space, register use. You can rewrite code to use those resources more efficiently (sometimes by trading one for another). Most of these tricks come from either [Jeff's GB Assembly Code Tips v1.0](http://www.devrs.com/gb/files/asmtips.txt), or [WikiTI's Z80 Optimization page](http://wikiti.brandonw.net/index.php?title=Z80_Optimization). (Note that Z80 assembly is *not* the same as GBZ80; it has more registers and some different instructions.) ## Contents - [Registers](#registers) - [Set `a` to 0](#set-a-to-0) - [Invert the bits of `a`](#invert-the-bits-of-a) - [Set `a` to some constant minus `a`](#set-a-to-some-constant-minus-a) - [Multiply `hl` by 2](#multiply-hl-by-2) - [Add `a` to a 16-bit register](#add-a-to-a-16-bit-register) - [Increment or decrement a 16-bit register](#increment-or-decrement-a-16-bit-register) - [Load from an address to `hl`](#load-from-an-address-to-hl) - [Exchange two 16-bit registers](#exchange-two-16-bit-registers) - [Load two constants into a register pair](#load-two-constants-into-a-register-pair) - [Load a constant into `[hl]`](#load-a-constant-into-hl) - [Load a constant into `[hl]` and incrementing or decrementing `hl`](#load-a-constant-into-hl-and-incrementing-or-decrementing-hl) - [Increment or decrement `[hl]`](#increment-or-decrement-hl) - [Branching (control flow)](#branching-control-flow) - [Relative jumps](#relative-jumps) - [Compare `a` to 0](#compare-a-to-0) - [Compare `a` to 1](#compare-a-to-1) - [Compare `a` to 255](#compare-a-to-255) - [Subroutines (functions)](#subroutines-functions) - [Tail call optimization](#tail-call-optimization) - [Call `hl`](#call-hl) - [Inlining](#inlining) - [Fallthrough](#fallthrough) - [Jump and lookup tables](#jump-and-lookup-tables) - [Chain comparisons](#chain-comparisons) ## Registers ### Set `a` to 0 Don't do: ```asm ld a, 0 ; 2 bytes, 2 cycles, no changes to flags ``` But do: ```asm xor a ; 1 byte, 1 cycle, sets flags C to 0 and Z to 1 ``` or do: ```asm sub a ; 1 byte, 1 cycle, sets flags C to 0 and Z to 1 ``` Don't use the optimized versions if you need to preserve flags. As such, `ld a, 0` must be left intact in the code below: ```asm ld a, [wIsTrainerBattle] and a ; NZ if in trainer battle ld a, 0 jr nz, .trainer ``` ### Invert the bits of `a` Don't do: ```asm xor $ff ; 2 bytes, 2 cycles ``` But do: ```asm cpl ; 1 byte, 1 cycle ``` ### Set `a` to some constant minus `a` Don't do: ```asm ; 4 bytes, 4 cycles ld b, a ld a, CONST sub b ``` But do: ```asm ; 3 bytes, 3 cycles cpl add CONST + 1 ``` ### Multiply `hl` by 2 Don't do: ```asm ; 6 bytes, 6 cycles sla l rl h ``` But do: ```asm ; 1 byte, 2 cycles add hl, hl ``` (The `SpeciesItemBoost` routine in [engine/battle/effect_commands.asm](../blob/master/engine/battle/effect_commands.asm) actually does this!) ### Add `a` to a 16-bit register (The example uses `hl`, but `bc` or `de` would also work.) Don't do: ```asm ; 6 bytes, 6 cycles add l ld l, a ld a, 0 adc h ld h, a ``` and don't do: ```asm ; 6 bytes, 6 cycles add l ld l, a ld a, h adc 0 ld h, a ``` But do: ```asm ; 5 bytes, 5 or 6 cycles add l ld l, a jr nc, .no_carry inc h .no_carry: ``` Or better, do: ```asm ; 5 bytes, 5 cycles add l ld l, a adc h sub l ld h, a ``` Or if you can spare another 16-bit register and want to optimize for size over speed, do: ```asm ; 4 bytes, 5 cycles ld d, 0 ld e, a add hl, de ``` ### Increment or decrement a 16-bit register When possible, avoid doing: ```asm inc hl ; 1 byte, 2 cycles ``` ```asm dec hl ; 1 byte, 2 cycles ``` If the low byte won't overflow, then do: ```asm inc l ; 1 byte, 1 cycle ``` ```asm dec l ; 1 byte, 1 cycle ``` ### Load from an address to `hl` Don't do: ```asm ; 8 bytes, 10 cycles ld a, [Address] ld l, a ld a, [Address+1] ld h, a ``` But do: ```asm ; 6 bytes, 8 cycles ld hl, Address ld a, [hli] ld h, [hl] ld l, a ``` ### Exchange two 16-bit registers (The example uses `hl` and `de`, but any pair of `bc`, `de`, or `hl` would also work.) If you care about speed: ```asm ; 6 bytes, 6 cycles ld a, d ld d, h ld h, a ld a, e ld e, l ld l, a ``` If you care about size: ```asm ; 4 bytes, 9 cycles push de ld d, h ld e, l pop hl ``` ### Load two constants into a register pair (The example uses `bc`, but `hl` or `de` would also work.) Don't do: ```asm ; 4 bytes, 4 cycles ld b, ONE ld c, TWO ``` But do: ```asm ; 3 bytes, 3 cycles ld bc, ONE << 8 | TWO ``` Or better, use the `lb` macro in [macros/code.asm](../blob/master/macros/code.asm): ```asm ; 3 bytes, 3 cycles lb bc, ONE, TWO ``` ### Load a constant into `[hl]` Don't do: ```asm ; 3 bytes, 4 cycles ld a, CONST ld [hl], a ``` But do: ```asm ; 2 bytes, 3 cycles ld [hl], CONST ``` ### Load a constant into `[hl]` and incrementing or decrementing `hl` Don't do: ```asm ; 2 bytes, 4 cycles ld [hl], a inc hl ``` But do: ```asm ; 1 bytes, 2 cycles ld [hli], a ``` And don't do: ```asm ; 2 bytes, 4 cycles ld [hl], a dec hl ``` But do: ```asm ; 1 bytes, 2 cycles ld [hld], a ``` ### Increment or decrement `[hl]` Don't do: ```asm ; 3 bytes, 5 cycles ld a, [hl] inc a ld [hl], a ``` But do: ```asm ; 1 bytes, 3 cycles inc [hl] ``` And don't do: ```asm ; 3 bytes, 5 cycles ld a, [hl] dec a ld [hl], a ``` But do: ```asm ; 1 bytes, 3 cycles dec [hl] ``` ## Branching (control flow) ### Relative jumps Don't do: ```asm jp Somewhere ; 3 bytes, 4 cycles ``` But do: ```asm jr Somewhere ; 2 bytes, 3 cycles ``` This only applies if `Somewhere` is within ±127 bytes of the jump. ### Compare `a` to 0 Don't do: ```asm cp 0 ; 2 bytes, 2 cycles ``` But do: ```asm or a ; 1 byte, 1 cycle ``` or do: ```asm and a ; 1 byte, 1 cycle ``` ### Compare `a` to 1 ```asm cp 1 ; 2 bytes, 2 cycles ``` If you don't care about the value in `a`: ```asm dec a ; 1 byte, 1 cycle, decrements a ``` Note that you can still do `inc a` afterwards, which is 1 cycle faster if the jump is taken. Compare: ```asm cp 1 jr z, .equals1 ``` with: ```asm dec a jr z, .equals1 inc a ``` ### Compare `a` to 255 (255, or $FF in hexadecimal, is the same as −1 due to [two's complement](https://en.wikipedia.org/wiki/Two%27s_complement).) ```asm cp $ff ; 2 bytes, 2 cycles ``` If you don't care about the value in `a`: ```asm dec a ; 1 byte, 1 cycle, decrements a ``` Note that you can still do `inc a` afterwards, which is 1 cycle faster if the jump is taken. Compare: ```asm cp $ff jr z, .equals255 ``` with: ```asm inc a jr z, .equals255 dec a ``` ## Subroutines (functions) ### Tail call optimization Don't do: ```asm ; 4 bytes, 10 cycles call Function ret ``` But do: ```asm ; 3 bytes, 4 cycles jp Function ``` ### Call `hl` ```asm ; 5 bytes, 8 cycles ld de, .return push de jp hl .return ... ``` But do: ```asm ; 4 bytes, 7 cycles call _hl_ ; return ... ``` `_hl_` is a routine already defined in [home.asm](../blob/master/home.asm): ```asm _hl_:: jp hl ``` ### Inlining Don't do: ```asm ; 4 additional bytes, 10 additional cycles call GetOffset ... GetOffset: (some code) ret ``` if `GetOffset` is only called a handful of times. Instead, do: ```asm ; GetOffset (some code) ``` You can set `(some code)` apart with blank lines and put a comment on top to make its self-contained nature clear without the extra `call` and `ret`. ### Fallthrough Don't do: ```asm ... call Function ret Function: ... ``` But do: ```asm ... ; fallthrough Function: ... ``` You can still `call Function` elsewhere, but one tail call can be optimized into a fallthrough. ## Jump and lookup tables ### Chain comparisons Don't do: ```asm cp 1 jr z, .equals1 cp 2 jr z, .equals2 cp 3 jr z, .equals3 ... ``` But do: ```asm dec a jr z, .equals1 dec a jr z, .equals2 dec a jr z, .equals3 ... ``` Or do: ```asm dec a ld hl, .jumptable ld e, a ld d, 0 add hl, de add hl, de ld a, [hli] ld h, [hl] ld l, a jp hl .jumptable: dw .equals1 dw .equals2 dw .equals3 ... ``` Or better, do: ```asm dec a ld hl, .jumptable rst JumpTable ... .jumptable: dw .equals1 dw .equals2 dw .equals3 ... ```