Sometimes the simplest way to write something in assembly code isn't the best. All of your resources are limited: CPU speed, ROM size, RAM space, register use. You can rewrite code to use those resources more efficiently (sometimes by trading one for another). Most of these tricks come from [Jeff's GB Assembly Code Tips v1.0](http://www.devrs.com/gb/files/asmtips.txt), [WikiTI's Z80 Optimization page](http://wikiti.brandonw.net/index.php?title=Z80_Optimization), [z80 Heaven's optimization tutorial](http://z80-heaven.wikidot.com/optimization), and [GBDev Wiki's ASM Snippets](https://gbdev.gg8.se/wiki/articles/ASM_Snippets). (Note that Z80 assembly is *not* the same as GBZ80; it has more registers and some different instructions.) WikiTI's advice fully applies here: > Note that the following tricks act much like a [peephole optimizer](https://en.wikipedia.org/wiki/Peephole_optimization) and are the last optimization step: remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code. > > Also note that nearly every trick turns the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code. > > Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on their use; comments warn about them. Some tricks apply to other cases, but again you have to be careful. > > There are some tricks that are nothing more than the correct use of the available instructions on the Z80. Keeping an [instruction set summary](https://rednex.github.io/rgbds/gbz80.7.html) helps to visualize what you can do during coding. (There's also a "cheat sheet" [table of instructions](https://gbdev.io/gb-opcodes//optables/classic) summarizing their bytes, cycles, and affected flags, if you don't need a long listing of what each one does.) ## Contents - [8-bit registers](#8-bit-registers) - [Set `a` to 0](#set-a-to-0) - [Increment or decrement `a`](#increment-or-decrement-a) - [Invert the bits of `a`](#invert-the-bits-of-a) - [Rotate the bits of `a`](#rotate-the-bits-of-a) - [Reverse the bits of `a`](#reverse-the-bits-of-a) - [Set `a` to some constant minus `a`](#set-a-to-some-constant-minus-a) - [Set `a` to one constant or another depending on the carry flag](#set-a-to-one-constant-or-another-depending-on-the-carry-flag) - [Increment or decrement `a` when the carry flag is set](#increment-or-decrement-a-when-the-carry-flag-is-set) - [Toggle `a` between two different constants](#toggle-a-between-two-different-constants) - [Divide `a` by 8 (shift `a` right 3 bits)](#divide-a-by-8-shift-a-right-3-bits) - [Divide `a` by 16 (shift `a` right 4 bits)](#divide-a-by-16-shift-a-right-4-bits) - [Set `a` to some value plus or minus carry](#set-a-to-some-value-plus-or-minus-carry) - [Add or subtract the carry flag from a register besides `a`](#add-or-subtract-the-carry-flag-from-a-register-besides-a) - [Load from HRAM to `a` or from `a` to HRAM](#load-from-hram-to-a-or-from-a-to-hram) - [16-bit registers](#16-bit-registers) - [Multiply `hl` by 2](#multiply-hl-by-2) - [Add `a` to a 16-bit register](#add-a-to-a-16-bit-register) - [Subtract an 8-bit constant from a 16-bit register](#subtract-an-8-bit-constant-from-a-16-bit-register) - [Set a 16-bit register to `a` plus a constant](#set-a-16-bit-register-to-a-plus-a-constant) - [Set a 16-bit register to `a` multiplied by 16](#set-a-16-bit-register-to-a-multiplied-by-16) - [Increment or decrement a 16-bit register](#increment-or-decrement-a-16-bit-register) - [Add or subtract the carry flag from a 16-bit register](#add-or-subtract-the-carry-flag-from-a-16-bit-register) - [Load from an address to `hl`](#load-from-an-address-to-hl) - [Load from an address to `sp`](#load-from-an-address-to-sp) - [Exchange two 16-bit registers](#exchange-two-16-bit-registers) - [Subtract two 16-bit registers](#subtract-two-16-bit-registers) - [Load two constants into a register pair](#load-two-constants-into-a-register-pair) - [Load a constant into `[hl]`](#load-a-constant-into-hl) - [Increment or decrement `[hl]`](#increment-or-decrement-hl) - [Load a constant into `[hl]` and increment or decrement `hl`](#load-a-constant-into-hl-and-increment-or-decrement-hl) - [Branching (control flow)](#branching-control-flow) - [Relative jumps](#relative-jumps) - [Compare `a` to 0](#compare-a-to-0) - [Compare `a` to 1](#compare-a-to-1) - [Compare `a` to 255](#compare-a-to-255) - [Compare `a` to 0 after masking it](#compare-a-to-0-after-masking-it) - [Compare `a` to a mask after masking it](#compare-a-to-a-mask-after-masking-it) - [Test whether `a` is negative (compare `a` to $80)](#test-whether-a-is-negative-compare-a-to-80) - [Subroutines (functions)](#subroutines-functions) - [Tail call optimization](#tail-call-optimization) - [Call `hl`](#call-hl) - [Inlining](#inlining) - [Fallthrough](#fallthrough) - [Conditional fallthrough](#conditional-fallthrough) - [Conditional return](#conditional-return) - [Conditional call](#conditional-call) - [Conditional `rst $38`](#conditional-rst-38) - [Enable interrupts and return](#enable-interrupts-and-return) - [Jump and lookup tables](#jump-and-lookup-tables) - [Chain comparisons](#chain-comparisons) - [Off-by-one `AddNTimes`](#off-by-one-addntimes) ## 8-bit registers ### Set `a` to 0 Don't do this: ```asm ld a, 0 ; 2 bytes, 2 cycles; no changes to flags ``` Instead, do this: ```asm xor a ; 1 byte, 1 cycle, sets flags C to 0 and Z to 1 ``` Or do this: ```asm sub a ; 1 byte, 1 cycle, sets flags C to 0 and Z to 1 ``` Don't use the optimized versions if you need to preserve flags. As such, `ld a, 0` must be left intact in the code below: ```asm ld a, [wIsTrainerBattle] and a ; sets zero flag if [wIsTrainerBattle] == 0 ld a, 0 ; sets a to 0 without affecting zero flag jr nz, .is_trainer_battle ; is not trainer battle ``` ### Increment or decrement `a` When possible, avoid doing this: ```asm add 1 ; 2 bytes, 2 cycles; sets carry for -1 to 0 overflow ``` ```asm sub 1 ; 2 bytes, 2 cycles; sets carry for 0 to -1 underflow ``` If you don't need to set the carry flag, then do this: ```asm inc a ; 1 byte, 1 cycle ``` ```asm dec a ; 1 byte, 1 cycle ``` ### Invert the bits of `a` Don't do this: ```asm xor $ff ; 2 bytes, 2 cycles ``` Instead, do this: ```asm cpl ; 1 byte, 1 cycle ``` ### Rotate the bits of `a` Don't do this: ```asm rl a ; 2 bytes, 2 cycles; updates Z and C flags ``` ```asm rlc a ; 2 bytes, 2 cycles; updates Z and C flags ``` ```asm rr a ; 2 bytes, 2 cycles; updates Z and C flags ``` ```asm rrc a ; 2 bytes, 2 cycles; updates Z and C flags ``` Instead, do this: ```asm rla ; 1 byte, 1 cycle; updates C flag ``` ```asm rlca ; 1 byte, 1 cycle; updates C flag ``` ```asm rra ; 1 byte, 1 cycle; updates C flag ``` ```asm rrca ; 1 byte, 1 cycle; updates C flag ``` The exception is if you need to set the zero flag when the operation results in 0 for `a`; the two-byte operations can set `z`, the one-byte operations cannot. ### Reverse the bits of `a` (This optimization is based on [Retro Programming](http://www.retroprogramming.com/2014/01/fast-z80-bit-reversal.html)). (The example uses `b` and `c`, but any of `d`, `e`, `h`, or `l` would also work.) Don't do this: ```asm ; 26 bytes, 26 cycles rept 8 rra ; nor rla rl b ; nor rr b endr ld a, b ``` And don't do this: ```asm ; 17 bytes, 17 cycles ld b, a rlca rlca xor b and $aa xor b ld b, a rlca rlca rlca rrc b xor b and $66 xor b ``` Instead, do this: ```asm ; 15 bytes, 15 cycles ld b, a rlca rlca xor b and $aa xor b ld b, a swap b xor b and $33 xor b rrca ``` Or if you really want to optimize for size over speed, then don't do this: ```asm ; 10 bytes, 59 cycles ld bc, 8 ; lb bc, 0, 8 .loop rra ; nor rla rl b ; nor rr b dec c jr nz, .loop ld a, b ``` Instead, do this: ```asm ; 8 bytes, 50 cycles ld b, 1 .loop rra rl b jr nc, .loop ld a, b ``` Or if you really want to optimize for speed over size, then do this: ```asm ; 6 bytes, 12 cycles ; (4 bytes, 5 cycles if you don't need the push hl/pop hl) push hl ld h, HIGH(ReversedBitTable) ld l, a ld a, [hl] pop hl ``` ```asm ; 256 bytes; placed in ROM0 or the same ROMX section as the bit reversal SECTION "ReversedBitTable", ROM0, ALIGN[8] ReversedBitTable:: for x, 256 ; http://graphics.stanford.edu/~seander/bithacks.html#ReverseByteWith32Bits db LOW(((((x * $802) & $22110) | ((x * $8020) & $88440)) * $10101) >> 16) endr ``` ### Set `a` to some constant minus `a` Don't do this: ```asm ; 4 bytes, 4 cycles ld b, a ld a, FOOBAR sub b ``` Instead, do this: ```asm ; 3 bytes, 3 cycles cpl add FOOBAR + 1 ``` ("What's [foobar](https://en.wikipedia.org/wiki/Foobar)?") ### Set `a` to one constant or another depending on the carry flag (The example sets `a` to `CVAL` if the carry flag is set (`c`), or `NCVAL` is the carry flag is not set (`nc`).) Don't do this: ```asm ; 6 bytes, 6 or 7 cycles ld a, CVAL jr c, .carry ld a, NCVAL .carry ``` And don't do this: ```asm ; 6 bytes, 6 or 7 cycles ld a, NCVAL jr nc, .no_carry ld a, CVAL .no_carry ``` And if either is 0, don't do this: ```asm ; 5 bytes, 5 cycles ld a, CVAL ; nor NCVAL jr c, .carry ; nor jr nc xor a .carry ``` And if either is 1 more or less than the other, don't do this: ```asm ; 5 bytes, 5 cycles ld a, CVAL ; nor NCVAL jr c, .carry ; nor jr nc inc a ; nor dec a .carry ``` Instead use `sbc a`, which copies the carry flag to all bits of `a`. So do this: ```asm ; 5 bytes, 5 cycles sbc a ; if carry, then $ff, else 0 and CVAL - NCVAL ; $ff becomes CVAL - NCVAL, 0 stays 0 add NCVAL ; CVAL - NCVAL becomes CVAL, 0 becomes NCVAL ``` Or do this: ```asm ; 5 bytes, 5 cycles sbc a ; if carry, then $ff, else 0 and CVAL ^ NCVAL ; $ff becomes CVAL ^ NCVAL, 0 stays 0 xor NCVAL ; CVAL ^ NCVAL becomes CVAL, 0 becomes NCVAL ``` And if certain conditions apply, then do something more efficient:
If this case... ...then do this:
`CVAL` == $FF (aka −1)
and
`NCVAL` == 0
```asm ; 1 byte, 1 cycle sbc a ; if carry, then $ff, else 0 ```
`CVAL` == 0
and
`NCVAL` == $FF (aka −1)
```asm ; 2 bytes, 2 cycles ccf ; invert carry flag sbc a ; if originally carry, then 0, else $ff ```
`CVAL` == 0
and
`NCVAL` == 1
```asm ; 2 bytes, 2 cycles sbc a ; if carry, then $ff aka -1, else 0 inc a ; -1 becomes 0, 0 becomes 1 ```
`CVAL` == $FF (aka −1) ```asm ; 3 bytes, 3 cycles sbc a ; if carry, then $ff, else 0 or NCVAL ; $ff stays $ff, $00 becomes NCVAL ```
`NCVAL` == 0 ```asm ; 3 bytes, 3 cycles sbc a ; if carry, then $ff, else 0 and CVAL ; $ff becomes CVAL, 0 stays 0 ```
`CVAL` == `NCVAL - 1`,
aka
`CVAL + 1` == `NCVAL`
```asm ; 3 bytes, 3 cycles sbc a ; if carry, then $ff aka -1, else 0 add NCVAL ; -1 becomes NCVAL - 1 aka CVAL, 0 becomes NCVAL ```
`CVAL` == `NCVAL - 2`,
aka
`CVAL + 2` == `NCVAL`
```asm ; 3 bytes, 3 cycles sbc a ; if carry, then $ff aka -1, else 0; doesn't change the carry flag sbc -NCVAL ; -1 becomes NCVAL - 2 aka CVAL, 0 becomes NCVAL ```
`CVAL` == 0 ```asm ; 4 bytes, 4 cycles ccf ; invert carry flag sbc a ; if originally carry, then 0, else $ff and NCVAL ; 0 stays 0, $ff becomes NCVAL ```
`NCVAL` == $FF (aka −1) ```asm ; 4 bytes, 4 cycles ccf ; invert carry flag sbc a ; if originally carry, then 0, else $ff or CVAL ; $00 becomes CVAL, $ff stays $ff ```
`CVAL` == `NCVAL + 1`,
aka
`CVAL - 1` == `NCVAL`
```asm ; 4 bytes, 4 cycles ccf ; invert carry flag sbc a ; if originally carry, then 0, else $ff aka -1 add CVAL ; -1 becomes CVAL - 1 aka NCVAL, 0 becomes CVAL ```
`CVAL` == `NCVAL + 2`,
aka
`CVAL - 2` == `NCVAL`
```asm ; 4 bytes, 4 cycles ccf ; invert carry flag sbc a ; if carry, then 0, else $ff aka -1; doesn't change the carry flag sbc -CVAL ; -1 becomes CVAL - 2 aka NCVAL, 0 becomes CVAL ```
### Increment or decrement `a` when the carry flag is set Don't do this: ```asm ; 3 bytes, 3 cycles jr nc, .ok inc a .ok ``` ```asm ; 3 bytes, 3 cycles jr nc, .ok dec a .ok ``` Instead, do this: ```asm adc 0 ; 2 bytes, 2 cycles ``` ```asm sbc 0 ; 2 bytes, 2 cycles ``` ### Toggle `a` between two different constants Don't do this: ```asm ; 12 bytes, 9 or 10 cycles cp FOO jr z, .foo_to_bar jr .bar_to_foo .foo_to_bar ld a, BAR jr .done .bar_to_foo ld a, FOO .done ... ``` And don't do this: ```asm ; 10 bytes, 7 or 9 cycles cp FOO jr z, .foo_to_bar ; nor jr nz, .bar_to_foo ld a, FOO ; nor ld a, BAR jr .done .foo_to_bar ; nor .bar_to_foo ld a, BAR ; nor ld a, FOO .done ... ``` (That would be applying the "[Conditional fallthrough](#conditional-fallthrough)" optimization to the first way.) Instead, do this: ```asm xor FOO ^ BAR ; 2 bytes, 2 cycles ``` (This works for the same reason as the [XOR swap algorithm](https://en.wikipedia.org/wiki/XOR_swap_algorithm) for swapping the values of two variables.) ### Divide `a` by 8 (shift `a` right 3 bits) Don't do this: ```asm ; 6 bytes, 9 cycles ; (15 bytes, at least 21 cycles, counting the definition of SimpleDivide) ld c, 8 ; divisor call SimpleDivide ld a, b ; quotient ``` And don't do this: ```asm ; 6 bytes, 6 cycles srl a srl a srl a ``` Instead, do this: ```asm ; 5 bytes, 5 cycles rrca rrca rrca and %00011111 ``` ### Divide `a` by 16 (shift `a` right 4 bits) Don't do this: ```asm ; 6 bytes, 9 cycles ; (15 bytes, at least 21 cycles, counting the definition of SimpleDivide) ld c, 16 ; divisor call SimpleDivide ld a, b ; quotient ``` And don't do this: ```asm ; 8 bytes, 8 cycles srl a srl a srl a srl a ``` Instead, do this: ```asm ; 4 bytes, 4 cycles swap a and $f ``` ### Set `a` to some value plus or minus carry (The example uses `b` and `c`, but any registers besides `a` would also work, including `[hl]`.) Don't do this: ```asm ; 4 bytes, 4 cycles ld b, a ld a, c adc 0 ``` ```asm ; 4 bytes, 4 cycles ld b, a ld a, c sbc 0 ``` And don't do this: ```asm ; 4 bytes, 4 cycles ld b, a ld a, 0 adc c ``` ```asm ; 4 bytes, 4 cycles ld b, a ld a, 0 sbc c ``` Instead, do this: ```asm ; 3 bytes, 3 cycles ld b, a adc c sub b ``` ```asm ; 3 bytes, 3 cycles ld b, a sbc c add b ``` Also, don't do this: ```asm ; 5 bytes, 5 cycles ld b, a ld a, N adc 0 ``` ```asm ; 5 bytes, 5 cycles ld b, a ld a, N sbc 0 ``` And don't do this: ```asm ; 5 bytes, 5 cycles ld b, a ld a, 0 adc N ``` ```asm ; 5 bytes, 5 cycles ld b, a ld a, 0 sbc N ``` Instead, do this: ```asm ; 4 bytes, 4 cycles ld b, a adc N sub b ``` ```asm ; 4 bytes, 4 cycles ld b, a sbc N add b ``` (If the original value of `a` was not backed up in `b`, this optimization would not apply.) ### Add or subtract the carry flag from a register besides `a` (The example uses `b`, but any of `c`, `d`, `e`, `h`, or `l` would also work.) Don't do this: ```asm ; 4 bytes, 4 cycles ld a, b adc 0 ld b, a ``` ```asm ; 4 bytes, 4 cycles ld a, b sbc 0 ld b, a ``` And don't do this: ```asm ; 4 bytes, 4 cycles ld a, 0 adc b ld b, a ``` ```asm ; 4 bytes, 4 cycles ld a, 0 sbc b ld b, a ``` Instead, do this: ```asm ; 3 bytes, 3 or 4 cycles jr nc, .no_carry inc b .no_carry ``` ```asm ; 3 bytes, 3 or 4 cycles jr nc, .no_carry dec b .no_carry ``` ### Load from HRAM to `a` or from `a` to HRAM Don't do this: ```asm ld a, [hFoobar] ; 3 bytes, 4 cycles ``` ```asm ld [hFoobar], a ; 3 bytes, 4 cycles ``` Instead, do this: ```asm ldh a, [hFoobar] ; 2 bytes, 3 cycles ``` ```asm ldh [hFoobar], a ; 2 bytes, 3 cycles ``` ## 16-bit registers ### Multiply `hl` by 2 Don't do this: ```asm ; 4 bytes, 4 cycles sla l rl h ``` Instead, do this: ```asm add hl, hl ; 1 byte, 2 cycles ``` ### Add `a` to a 16-bit register (The example uses `hl`, but `bc` or `de` would also work.) Don't do this: ```asm ; 6 bytes, 6 cycles add l ld l, a ld a, 0 adc h ld h, a ``` And don't do this: ```asm ; 6 bytes, 6 cycles add l ld l, a ld a, h adc 0 ld h, a ``` And don't do this: ```asm ; 5 bytes, 5 cycles add l ld l, a jr nc, .no_carry inc h .no_carry ``` Instead, do this: ```asm ; 5 bytes, 5 cycles; no labels add l ld l, a adc h sub l ld h, a ``` Or if you can spare another 16-bit register and want to optimize for size over speed, then do this: ```asm ; 4 bytes, 5 cycles ld d, 0 ld e, a add hl, de ``` ### Subtract an 8-bit constant from a 16-bit register (The example uses `hl`, but `bc` or `de` would also work.) Don't do this: ```asm ; 8 bytes, 8 cycles ld a, l sub FOOBAR ld l, a ld a, h sbc 0 ld h, a ``` Instead, do this: ```asm ; 7 bytes, 7 or 8 cycles ld a, l sub FOOBAR ld l, a jr nc, .no_carry dec h .no_carry ``` (This is a case of "[Add or subtract the carry flag from a register besides `a`](#add-or-subtract-the-carry-flag-from-a-register-besides-a)", applied to the high part of a 16-bit register.) Or if you can spare another 16-bit register, do this: ```asm ; 4 bytes, 5 cycles ld de, -FOOBAR add hl, de ``` ### Set a 16-bit register to `a` plus a constant (The example uses `hl`, but `bc` or `de` would also work.) Don't do this: ```asm ; 7 bytes, 8 cycles; uses another 16-bit register ld e, a ld d, 0 ld hl, FooBar add hl, de ``` And don't do this: ```asm ; 8 bytes, 8 cycles ld hl, FooBar add l ld l, a adc h sub l ld h, a ``` And don't do this: ```asm ; 8 bytes, 8 cycles ld h, HIGH(FooBar) add LOW(FooBar) ld l, a jr nc, .no_carry inc h .no_carry ``` Instead, do this: ```asm ; 7 bytes, 7 cycles add LOW(FooBar) ld l, a adc HIGH(FooBar) sub l ld h, a ``` Or if the constant is 8-bit and nonzero (i.e. 0 < `FooBar` < 256), then do this: ```asm ; 6 bytes, 6 cycles sub LOW(-FooBar) ld l, a sbc a inc a ld h, a ``` Or if the constant is zero (i.e. `FooBar` == 0 and `a` + `FooBar` == `a`), then do this: ```asm ; 3 bytes, 3 cycles ld l, a ld h, 0 ``` ### Set a 16-bit register to `a` multiplied by 16 (The example uses `hl`, but `bc` or `de` would also work.) You can do this: ```asm ; 7 bytes, 11 cycles ld l, a ld h, 0 add hl, hl add hl, hl add hl, hl add hl, hl ``` ```asm ; 7 bytes, 11 cycles ld l, a ld h, 0 rept 4 add hl, hl endr ``` But if `a` is definitely small enough, and its value can be changed, then do one of these: ```asm ; 7 bytes, 10 cycles; sets a = a * 2; requires a < $80 add a ld l, a ld h, 0 add hl, hl add hl, hl add hl, hl ``` ```asm ; 7 bytes, 9 cycles; sets a = a * 4; requires a < $40 add a add a ld l, a ld h, 0 add hl, hl add hl, hl ``` ```asm ; 7 bytes, 8 cycles; sets a = a * 8; requires a < $20 add a add a add a ld l, a ld h, 0 add hl, hl ``` ```asm ; 5 bytes, 5 cycles; sets a = a * 16; requires a < $10 swap a ld l, a ld h, 0 ``` Or if the value of `a` can be changed and you want to optimize for speed over size, then do one of these: ```asm ; 8 bytes, 8 cycles; sets a = l swap a ld l, a and $f ld h, a xor l ld l, a ``` ```asm ; 8 bytes, 8 cycles; sets a = h swap a ld h, a and $f0 ld l, a xor h ld h, a ``` ### Increment or decrement a 16-bit register When possible, avoid doing this: ```asm inc hl ; 1 byte, 2 cycles ``` ```asm dec hl ; 1 byte, 2 cycles ``` If the low byte *definitely* won't overflow, then do this: ```asm inc l ; 1 byte, 1 cycle ``` ```asm dec l ; 1 byte, 1 cycle ``` This is applicable, for instance, if you're reading a data table via `hl` one byte at a time, it has no more than 256 entries, and it's in its own `SECTION` which has been `ALIGN`ed to 8 bits. It's unlikely to apply to pokecrystal's existing systems. ### Add or subtract the carry flag from a 16-bit register (The example uses `hl`, but `bc` or `de` would also work.) Don't do this: ```asm ; 8 bytes, 8 cycles ld a, l ; nor ld a, 0 adc 0 ; nor adc l ld l, a ld a, h ; nor ld a, 0 adc 0 ; nor adc h ld h, a ``` ```asm ; 8 bytes, 8 cycles ld a, l ; nor ld a, 0 sbc 0 ; nor sbc l ld l, a ld a, h ; nor ld a, 0 sbc 0 ; nor sbc h ld h, a ``` And don't do this: ```asm ; 7 bytes, 7 cycles ld a, l ; nor ld a, 0 adc 0 ; nor adc l ld l, a adc h sub l ld h, a ``` ```asm ; 7 bytes, 7 cycles ld a, l ; nor ld a, 0 sbc 0 ; nor sbc l ld l, a sbc h add l ld h, a ``` (That would be applying the "[Set `a` to some value plus or minus carry](#set-a-to-some-value-plus-or-minus-carry)" optimization to part of the first way.) And don't do this: ```asm ; 7 bytes, 7 or 8 cycles ld a, l ; nor ld a, 0 adc 0 ; nor adc l ld l, a jr nc, .no_carry inc h .no_carry ``` ```asm ; 7 bytes, 7 or 8 cycles ld a, l ; nor ld a, 0 sbc 0 ; nor sbc l ld l, a jr nc, .no_carry dec h .no_carry ``` (That would be applying the "[Add or subtract the carry flag from a register besides `a`](#add-or-subtract-the-carry-flag-from-a-register-besides-a)" optimization to part of the first way.) Instead, do this: ```asm ; 3 bytes, 4 or 5 cycles jr nc, .no_carry inc hl .no_carry ``` ```asm ; 3 bytes, 4 or 5 cycles jr nc, .no_carry dec hl .no_carry ``` ### Load from an address to `hl` Don't do this: ```asm ; 8 bytes, 10 cycles ld a, [Address] ; LSB first ld l, a ld a, [Address+1] ld h, a ``` Instead, do this: ```asm ; 6 bytes, 8 cycles ld hl, Address ld a, [hli] ld h, [hl] ld l, a ``` And don't do this: ```asm ; 8 bytes, 10 cycles ld a, [Address] ; MSB first ld h, a ld a, [Address+1] ld l, a ``` Instead, do this: ```asm ; 6 bytes, 8 cycles ld hl, Address ld a, [hli] ld l, [hl] ld h, a ``` ### Load from an address to `sp` Don't do this: ```asm ; 9 bytes, 12 cycles ld a, [Address] ld l, a ld a, [Address+1] ld h, a ld sp, hl ``` And don't do this: ```asm ; 7 bytes, 10 cycles ldh a, [hAddress] ld l, a ldh a, [hAddress+1] ld h, a ld sp, hl ``` And don't do this: ```asm ; 7 bytes, 10 cycles ld hl, Address ld a, [hli] ld h, [hl] ld l, a ld sp, hl ``` (That would be applying the "[Load from an address to `hl`](#load-from-an-address-to-hl)" optimization to the first way.) Instead, do this: ```asm ; 5 bytes, 8 cycles ld sp, Address pop hl ld sp, hl ``` Or if the address is already in `hl`, then don't do this: ```asm ; 4 bytes, 7 cycles ld a, [hli] ld h, [hl] ld l, a ld sp, hl ``` Instead, do this: ```asm ; 3 bytes, 7 cycles ld sp, hl pop hl ld sp, hl ``` ### Exchange two 16-bit registers (The example uses `hl` and `de`, but any pair of `bc`, `de`, or `hl` would also work.) If you care about speed, then do this: ```asm ; 6 bytes, 6 cycles ld a, d ld d, h ld h, a ld a, e ld e, l ld l, a ``` If you care about size, then do this: ```asm ; 4 bytes, 9 cycles push de ld d, h ld e, l pop hl ``` ### Subtract two 16-bit registers (The example uses `hl` and `de`, but any pair of `bc`, `de`, or `hl` would also work.) Don't do this: ```asm ; 9 bytes, 10 cycles; modifies subtrahend de ld a, $ff xor d ld d, a ld a, $ff xor e ld e, a add hl, de ``` And don't do this: ```asm ; 7 bytes, 8 cycles; modifies subtrahend de ld a, d cpl ld d, a ld a, e cpl ld e, a add hl, de ``` Instead, do this: ```asm ; 6 bytes, 6 cycles ld a, l sub e ld l, a ld a, h sbc d ld h, a ``` ### Load two constants into a register pair (The example uses `bc`, but `hl` or `de` would also work.) Don't do this: ```asm ; 4 bytes, 4 cycles ld b, FOO ld c, BAR ``` Instead, do this: ```asm ld bc, FOO << 8 | BAR ; 3 bytes, 3 cycles ``` Or better, use the `lb` macro in [macros/code.asm](../blob/master/macros/code.asm): ```asm lb bc, FOO, BAR ; 3 bytes, 3 cycles ``` ### Load a constant into `[hl]` Don't do this: ```asm ; 3 bytes, 4 cycles ld a, FOOBAR ld [hl], a ``` Instead, do this: ```asm ld [hl], FOOBAR ; 2 bytes, 3 cycles ``` ### Increment or decrement `[hl]` Don't do this: ```asm ; 3 bytes, 5 cycles ld a, [hl] inc a ld [hl], a ``` ```asm ; 3 bytes, 5 cycles ld a, [hl] dec a ld [hl], a ``` Instead, do this: ```asm inc [hl] ; 1 bytes, 3 cycles ``` ```asm dec [hl] ; 1 bytes, 3 cycles ``` ### Load a constant into `[hl]` and increment or decrement `hl` Don't do this: ```asm ; 2 bytes, 4 cycles ld [hl], a inc hl ``` ```asm ; 2 bytes, 4 cycles ld [hl], a dec hl ``` Instead, do this: ```asm ld [hli], a ; 1 bytes, 2 cycles ``` ```asm ld [hld], a ; 1 bytes, 2 cycles ``` And if you can use `a`, then don't do this: ```asm ; 3 bytes, 5 cycles ld [hl], FOO inc hl ``` ```asm ; 3 bytes, 5 cycles ld [hl], FOO dec hl ``` Instead, do this: ```asm ; 3 bytes, 4 cycles ld a, FOO ld [hli], a ``` ```asm ; 3 bytes, 4 cycles ld a, FOO ld [hld], a ``` ## Branching (control flow) ### Relative jumps Don't do this: ```asm jp Somewhere ; 3 bytes, 4 cycles ``` Instead, do this: ```asm jr Somewhere ; 2 bytes, 3 cycles ``` This only applies if `Somewhere` is within ±128 bytes of the jump. You can define a `jmp` macro to use instead of `jp`, which will warn you when it can be `jr` instead: ``` jmp: MACRO if _NARG == 1 jp \1 else jp \1, \2 shift endc assert warn, (\1) - @ > 127 || (\1) - @ < -129, "jp can be jr" ENDM ``` ### Compare `a` to 0 Don't do this: ```asm cp 0 ; 2 bytes, 2 cycles ``` And don't do this: ```asm or 0 ; 2 bytes, 2 cycles ``` And don't do this: ```asm and $ff ; 2 bytes, 2 cycles ``` Instead, do this: ```asm or a ; 1 byte, 1 cycle ``` Or do this: ```asm and a ; 1 byte, 1 cycle ``` ### Compare `a` to 1 Do this: ```asm cp 1 ; 2 bytes, 2 cycles; updates Z and C flags ``` Or if you don't care about the value in `a`, and don't need to set the carry flag, then do this: ```asm dec a ; 1 byte, 1 cycle; decrements a, updates Z flag ``` Note that you can still do `inc a` afterwards, which is one cycle faster if the jump is taken. Compare this: ```asm ; 4 bytes, 4 or 5 cycles cp 1 jr z, .equals1 ``` with this: ```asm ; 4 bytes, 4 cycles dec a jr z, .equals1 inc a ``` ### Compare `a` to 255 (255, or $FF in hexadecimal, is the same as −1 due to [two's complement](https://en.wikipedia.org/wiki/Two%27s_complement).) Do this: ```asm cp $ff ; 2 bytes, 2 cycles; updates Z and C flags ``` Or if you don't care about the value in `a`, and don't need to set the carry flag, then do this: ```asm inc a ; 1 byte, 1 cycle; increments a, updates Z flag ``` Note that you can still do `dec a` afterwards, which is one cycle faster if the jump is taken. Compare this: ```asm ; 4 bytes, 4 or 5 cycles cp $ff jr z, .equals255 ``` with this: ```asm ; 4 bytes, 4 cycles inc a jr z, .equals255 dec a ``` ### Compare `a` to 0 after masking it Don't do this: ```asm ; 3 bytes, 3 cycles; sets zero flag if a == 0 and MASK and a ``` Instead, do this: ```asm and MASK ; 2 bytes, 2 cycles; sets zero flag if a == 0 ``` ### Compare `a` to a mask after masking it Don't do this: ```asm ; 4 bytes, 4 cycles; sets zero flag if a == MASK and carry flag if a < MASK and MASK cp MASK ``` If you don't need to set the carry flag, and don't need the masked value of `a`, then do this: ```asm ; 3 bytes, 3 cycles; sets zero flag if a was equal to MASK or ~MASK inc a ``` ### Test whether `a` is negative (compare `a` to $80) If you don't need to preserve the value in `a`, then don't do this: ```asm ; 4 bytes, 4 or 5 cycles cp $80 jr nc, .negative ``` And don't do this: ```asm ; 4 bytes, 4 or 5 cycles bit 7, a jr nz, .negative ``` Instead, do this: ```asm ; 3 bytes, 3 or 4 cycles; modifies a rlca jr c, .negative ``` ## Subroutines (functions) ### Tail call optimization Don't do this: ```asm ; 4 bytes, 10 cycles call Function ret ``` Instead, do this: ```asm jp Function ; 3 bytes, 4 cycles ``` ### Call `hl` Don't do this: ```asm ; 5 bytes, 8 cycles (some code) ld de, .return push de jp hl .return: (some more code) ``` Instead, do this: ```asm ; 3 bytes, 6 cycles ; (4 bytes, 7 cycles, counting the definition of _hl_) (some code) call _hl_ (some more code) ``` `_hl_` is a routine already defined in [home/call_regs.asm](../blob/master/home/call_regs.asm): ```asm _hl_:: jp hl ``` ### Inlining Don't do this: ```asm ; 4 additional bytes, 10 additional cycles (some code) call Function (some more code) Function: (function code) ret ``` if `Function` is only called a handful of times. Instead, do: ```asm (some code) ; Function (function code) (some more code) ``` You shouldn't do this if `Function` used any `ret`urns besides the one at the very end, or if inlining its code would make some `jr`s too distant from their targets. ### Fallthrough Don't do this: ```asm (some code) call Function ret Function: (function code) ret ``` And don't do this: ```asm (some code) jp Function Function: (function code) ret ``` Instead, do this: ```asm (some code) ; fallthrough Function: (function code) ret ``` Fallthrough is what you get when you combine inlining with tail calls. You can still `call Function` elsewhere, but one tail call can be optimized into a fallthrough. ## Conditional fallthrough (The example uses `z`, but `nz`, `c`, or `nc` would also work.) Don't do this: ```asm (some code) jr z, .foo jr .bar .foo (foo code) .bar (bar code) ``` Instead, do this: ```asm (some code) jr nz, .bar ; fallthrough .foo (foo code) .bar (bar code) ``` ## Conditional return (The example uses `z`, but `nz`, `c`, or `nc` would also work.) Don't do this: ```asm ; 3 bytes, 3 or 6 cycles jr z, .skip ret .skip ... ``` And don't do this: ```asm ; 3 bytes, 7 or 2 cycles jr nz, .return ... .return ret ``` Instead, do this: ```asm ; 1 byte, 5 or 2 cycles ret nz ... ``` ### Conditional call (The example uses `z`, but `nz`, `c`, or `nc` would also work.) Don't do this: ```asm ; 5 bytes, 3 or 9 cycles jr nz, .skip call Foo .skip ``` Instead, do this: ```asm ; 3 bytes, 6 or 3 cycles call z, Foo ``` And don't do this: ```asm ; 5 bytes, 3 or 9 cycles jr nz, .skip jp Foo .skip ``` Instead, do this: ```asm ; 3 bytes, 6 or 3 cycles jp z, Foo ``` ### Conditional `rst $38` (The example uses `z`, but `nz`, `c`, or `nc` would also work.) Don't do this: ```asm ; 5 bytes, 3 or 14 cycles call z, RstVector38 ... RstVector38: rst $38 ret ``` And don't do this: ```asm ; 3 bytes, 3 or 6 cycles jr nz, .no_rst_38 rst $38 .no_rst_38 ... ``` And don't do this: ```asm ; 3 bytes, 3 or 6 cycles call z, $0038 ... ``` Instead, do this: ```asm ; 2 bytes, 2 or 7 cycles jr z, @ + 1 ; the byte for @ + 1 is $ff, which is the opcode for rst $38 ... ``` (The label `@` evaluates to the current `pc` value, which in `jr z, @ + 1` is right before the `jr` instruction. The instruction consists of two bytes, the opcode and the relative offset. `@ + 1` evaluates to in-between those two bytes. The `jr` instruction encodes its offset relative to the *end* of the instruction, i.e. the *next* `pc` value after the instruction has been read, so the relative offset is `-1`, aka `$ff`.) ### Enable interrupts and return Don't do this: ```asm ; 2 bytes, 5 cycles ei ret ``` Instead, do this: ```asm ; 1 byte, 4 cycles reti ``` ## Jump and lookup tables ### Chain comparisons Don't do this: ```asm cp 1 jr z, .equals1 cp 2 jr z, .equals2 cp 3 jr z, .equals3 ... ``` Instead, do this: ```asm dec a jr z, .equals1 dec a jr z, .equals2 dec a jr z, .equals3 ... ``` Or do this: ```asm dec a ld hl, .jumptable ld e, a ld d, 0 add hl, de add hl, de ld a, [hli] ld h, [hl] ld l, a jp hl .jumptable: dw .equals1 dw .equals2 dw .equals3 ... ``` Or better, do: ```asm dec a ld hl, .jumptable rst JumpTable ret .jumptable: dw .equals1 dw .equals2 dw .equals3 ... ``` `JumpTable` is an `rst` routine already defined in [home/header.asm](../blob/master/home/header.asm): ```asm JumpTable:: push de ld e, a ld d, 0 add hl, de add hl, de ld a, [hli] ld h, [hl] ld l, a pop de jp hl ``` ### Off-by-one `AddNTimes` Don't do this: ```asm ld hl, Foo ld bc, BAR dec a call AddNTimes ``` Instead, as long as you don't need to add 255 times when a is 0, then do this: ```asm ld hl, Foo - BAR ld bc, BAR call AddNTimes ```