Sometimes the simplest way to write something in assembly code isn't the best. All of your resources are limited: CPU speed, ROM size, RAM space, register use. You can rewrite code to use those resources more efficiently (sometimes by trading one for another).

Most of these tricks come from [Jeff's GB Assembly Code Tips v1.0](http://www.devrs.com/gb/files/asmtips.txt), [WikiTI's Z80 Optimization page](http://wikiti.brandonw.net/index.php?title=Z80_Optimization), [z80 Heaven's optimization tutorial](http://z80-heaven.wikidot.com/optimization), and [GBDev Wiki's ASM Snippets](https://gbdev.gg8.se/wiki/articles/ASM_Snippets). (Note that Z80 assembly is *not* the same as GBZ80; it has more registers and some different instructions.)

WikiTI's advice fully applies here:

> Note that the following tricks act much like a [peephole optimizer](https://en.wikipedia.org/wiki/Peephole_optimization) and are the last optimization step: remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.
>
> Also note that nearly every trick turns the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code.
>
> Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on their use; comments warn about them. Some tricks apply to other cases, but again you have to be careful.
>
> There are some tricks that are nothing more than the correct use of the available instructions on the Z80. Keeping an [instruction set summary](https://rednex.github.io/rgbds/gbz80.7.html) helps to visualize what you can do during coding.

(There's also a "cheat sheet" [table of instructions](https://gbdev.io/gb-opcodes//optables/classic) summarizing their bytes, cycles, and affected flags, if you don't need a long listing of what each one does.)


## Contents

- [8-bit registers](#8-bit-registers)
  - [Set `a` to 0](#set-a-to-0)
  - [Increment or decrement `a`](#increment-or-decrement-a)
  - [Invert the bits of `a`](#invert-the-bits-of-a)
  - [Rotate the bits of `a`](#rotate-the-bits-of-a)
  - [Reverse the bits of `a`](#reverse-the-bits-of-a)
  - [Set `a` to some constant minus `a`](#set-a-to-some-constant-minus-a)
  - [Set `a` to one constant or another depending on the carry flag](#set-a-to-one-constant-or-another-depending-on-the-carry-flag)
  - [Increment or decrement `a` when the carry flag is set](#increment-or-decrement-a-when-the-carry-flag-is-set)
  - [Toggle `a` between two different constants](#toggle-a-between-two-different-constants)
  - [Divide `a` by 8 (shift `a` right 3 bits)](#divide-a-by-8-shift-a-right-3-bits)
  - [Divide `a` by 16 (shift `a` right 4 bits)](#divide-a-by-16-shift-a-right-4-bits)
  - [Set `a` to some value plus or minus carry](#set-a-to-some-value-plus-or-minus-carry)
  - [Add or subtract the carry flag from a register besides `a`](#add-or-subtract-the-carry-flag-from-a-register-besides-a)
  - [Load from HRAM to `a` or from `a` to HRAM](#load-from-hram-to-a-or-from-a-to-hram)
- [16-bit registers](#16-bit-registers)
  - [Multiply `hl` by 2](#multiply-hl-by-2)
  - [Add `a` to a 16-bit register](#add-a-to-a-16-bit-register)
  - [Subtract an 8-bit constant from a 16-bit register](#subtract-an-8-bit-constant-from-a-16-bit-register)
  - [Set a 16-bit register to `a` plus a constant](#set-a-16-bit-register-to-a-plus-a-constant)
  - [Set a 16-bit register to `a` multiplied by 16](#set-a-16-bit-register-to-a-multiplied-by-16)
  - [Increment or decrement a 16-bit register](#increment-or-decrement-a-16-bit-register)
  - [Add or subtract the carry flag from a 16-bit register](#add-or-subtract-the-carry-flag-from-a-16-bit-register)
  - [Load from an address to `hl`](#load-from-an-address-to-hl)
  - [Load from an address to `sp`](#load-from-an-address-to-sp)
  - [Exchange two 16-bit registers](#exchange-two-16-bit-registers)
  - [Subtract two 16-bit registers](#subtract-two-16-bit-registers)
  - [Load two constants into a register pair](#load-two-constants-into-a-register-pair)
  - [Load a constant into `[hl]`](#load-a-constant-into-hl)
  - [Increment or decrement `[hl]`](#increment-or-decrement-hl)
  - [Load a constant into `[hl]` and increment or decrement `hl`](#load-a-constant-into-hl-and-increment-or-decrement-hl)
- [Branching (control flow)](#branching-control-flow)
  - [Relative jumps](#relative-jumps)
  - [Compare `a` to 0](#compare-a-to-0)
  - [Compare `a` to 1](#compare-a-to-1)
  - [Compare `a` to 255](#compare-a-to-255)
  - [Compare `a` to 0 after masking it](#compare-a-to-0-after-masking-it)
  - [Compare `a` to a mask after masking it](#compare-a-to-a-mask-after-masking-it)
  - [Test whether `a` is negative (compare `a` to $80)](#test-whether-a-is-negative-compare-a-to-80)
- [Subroutines (functions)](#subroutines-functions)
  - [Tail call optimization](#tail-call-optimization)
  - [Call `hl`](#call-hl)
  - [Inlining](#inlining)
  - [Fallthrough](#fallthrough)
  - [Conditional fallthrough](#conditional-fallthrough)
  - [Conditional return](#conditional-return)
  - [Conditional call](#conditional-call)
  - [Conditional `rst $38`](#conditional-rst-38)
  - [Enable interrupts and return](#enable-interrupts-and-return)
- [Jump and lookup tables](#jump-and-lookup-tables)
  - [Chain comparisons](#chain-comparisons)
  - [Off-by-one `AddNTimes`](#off-by-one-addntimes)


## 8-bit registers


### Set `a` to 0

Don't do this:

```asm
	ld a, 0 ; 2 bytes, 2 cycles; no changes to flags
```

Instead, do this:

```asm
	xor a ; 1 byte, 1 cycle, sets flags C to 0 and Z to 1
```

Or do this:

```asm
	sub a ; 1 byte, 1 cycle, sets flags C to 0 and Z to 1
```

Don't use the optimized versions if you need to preserve flags. As such, `ld a, 0` must be left intact in the code below:

```asm
	ld a, [wIsTrainerBattle]
	and a   ; sets zero flag if [wIsTrainerBattle] == 0
	ld a, 0 ; sets a to 0 without affecting zero flag
	jr nz, .is_trainer_battle
	; is not trainer battle
```


### Increment or decrement `a`

When possible, avoid doing this:

```asm
	add 1 ; 2 bytes, 2 cycles; sets carry for -1 to 0 overflow
```

```asm
	sub 1 ; 2 bytes, 2 cycles; sets carry for 0 to -1 underflow
```

If you don't need to set the carry flag, then do this:

```asm
	inc a ; 1 byte, 1 cycle
```

```asm
	dec a ; 1 byte, 1 cycle
```


### Invert the bits of `a`

Don't do this:

```asm
	xor $ff ; 2 bytes, 2 cycles
```

Instead, do this:

```asm
	cpl ; 1 byte, 1 cycle
```


### Rotate the bits of `a`

Don't do this:

```asm
	rl a ; 2 bytes, 2 cycles; updates Z and C flags
```

```asm
	rlc a ; 2 bytes, 2 cycles; updates Z and C flags
```

```asm
	rr a ; 2 bytes, 2 cycles; updates Z and C flags
```

```asm
	rrc a ; 2 bytes, 2 cycles; updates Z and C flags
```

Instead, do this:

```asm
	rla ; 1 byte, 1 cycle; updates C flag
```

```asm
	rlca ; 1 byte, 1 cycle; updates C flag
```

```asm
	rra ; 1 byte, 1 cycle; updates C flag
```

```asm
	rrca ; 1 byte, 1 cycle; updates C flag
```

The exception is if you need to set the zero flag when the operation results in 0 for `a`; the two-byte operations can set `z`, the one-byte operations cannot.


### Reverse the bits of `a`

(This optimization is based on [Retro Programming](http://www.retroprogramming.com/2014/01/fast-z80-bit-reversal.html)).

(The example uses `b` and `c`, but any of `d`, `e`, `h`, or `l` would also work.)

Don't do this:

```asm
	; 26 bytes, 26 cycles
rept 8
	rra  ; nor rla
	rl b ; nor rr b
endr
	ld a, b
```

And don't do this:

```asm
	; 17 bytes, 17 cycles
	ld b, a
	rlca
	rlca
	xor b
	and $aa
	xor b
	ld b, a
	rlca
	rlca
	rlca
	rrc b
	xor b
	and $66
	xor b
```

Instead, do this:

```asm
	; 15 bytes, 15 cycles
	ld b, a
	rlca
	rlca
	xor b
	and $aa
	xor b
	ld b, a
	swap b
	xor b
	and $33
	xor b
	rrca
```

Or if you really want to optimize for size over speed, then don't do this:

```asm
	; 10 bytes, 59 cycles
	ld bc, 8  ; lb bc, 0, 8
.loop
	rra  ; nor rla
	rl b ; nor rr b
	dec c
	jr nz, .loop
	ld a, b
```

Instead, do this:

```asm
	; 8 bytes, 50 cycles
	ld b, 1
.loop
	rra
	rl b
	jr nc, .loop
	ld a, b
```

Or if you really want to optimize for speed over size, then do this:

```asm
	; 6 bytes, 12 cycles
	; (4 bytes, 5 cycles if you don't need the push hl/pop hl)
	push hl
	ld h, HIGH(ReversedBitTable)
	ld l, a
	ld a, [hl]
	pop hl
```

```asm
	; 256 bytes; placed in ROM0 or the same ROMX section as the bit reversal
SECTION "ReversedBitTable", ROM0, ALIGN[8]
ReversedBitTable::
for x, 256
	; http://graphics.stanford.edu/~seander/bithacks.html#ReverseByteWith32Bits
	db LOW(((((x * $802) & $22110) | ((x * $8020) & $88440)) * $10101) >> 16)
endr
```


### Set `a` to some constant minus `a`

Don't do this:

```asm
	; 4 bytes, 4 cycles
	ld b, a
	ld a, FOOBAR
	sub b
```

Instead, do this:

```asm
	; 3 bytes, 3 cycles
	cpl
	add FOOBAR + 1
```

("What's [foobar](https://en.wikipedia.org/wiki/Foobar)?")


### Set `a` to one constant or another depending on the carry flag

(The example sets `a` to `CVAL` if the carry flag is set (`c`), or `NCVAL` is the carry flag is not set (`nc`).)

Don't do this:

```asm
	; 6 bytes, 6 or 7 cycles
	ld a, CVAL
	jr c, .carry
	ld a, NCVAL
.carry
```

And don't do this:

```asm
	; 6 bytes, 6 or 7 cycles
	ld a, NCVAL
	jr nc, .no_carry
	ld a, CVAL
.no_carry
```

And if either is 0, don't do this:

```asm
	; 5 bytes, 5 cycles
	ld a, CVAL   ; nor NCVAL
	jr c, .carry ; nor jr nc
	xor a
.carry
```

And if either is 1 more or less than the other, don't do this:

```asm
	; 5 bytes, 5 cycles
	ld a, CVAL   ; nor NCVAL
	jr c, .carry ; nor jr nc
	inc a        ; nor dec a
.carry
```

Instead use `sbc a`, which copies the carry flag to all bits of `a`. So do this:

```asm
	; 5 bytes, 5 cycles
	sbc a            ; if carry, then $ff, else 0
	and CVAL - NCVAL ; $ff becomes CVAL - NCVAL, 0 stays 0
	add NCVAL        ; CVAL - NCVAL becomes CVAL, 0 becomes NCVAL
```

Or do this:

```asm
	; 5 bytes, 5 cycles
	sbc a            ; if carry, then $ff, else 0
	and CVAL ^ NCVAL ; $ff becomes CVAL ^ NCVAL, 0 stays 0
	xor NCVAL        ; CVAL ^ NCVAL becomes CVAL, 0 becomes NCVAL
```

And if certain conditions apply, then do something more efficient:

<table>

<tr>
<th>If this case...</th>
<th>...then do this:</th>
</tr>

<tr><td>

`CVAL` == $FF (aka −1) <br>and<br> `NCVAL` == 0

</td><td>

```asm
	; 1 byte, 1 cycle
	sbc a ; if carry, then $ff, else 0
```

</td></tr>
<tr><td>

`CVAL` == 0 <br>and<br> `NCVAL` == $FF (aka −1)

</td><td>

```asm
	; 2 bytes, 2 cycles
	ccf   ; invert carry flag
	sbc a ; if originally carry, then 0, else $ff
```

</td></tr>
<tr><td>

`CVAL` == 0 <br>and<br> `NCVAL` == 1

</td><td>

```asm
	; 2 bytes, 2 cycles
	sbc a ; if carry, then $ff aka -1, else 0
	inc a ; -1 becomes 0, 0 becomes 1
```

</td></tr>
<tr><td>

`CVAL` == $FF (aka −1)

</td><td>

```asm
	; 3 bytes, 3 cycles
	sbc a     ; if carry, then $ff, else 0
	or NCVAL  ; $ff stays $ff, $00 becomes NCVAL
```

</td></tr>
<tr><td>

`NCVAL` == 0

</td><td>

```asm
	; 3 bytes, 3 cycles
	sbc a    ; if carry, then $ff, else 0
	and CVAL ; $ff becomes CVAL, 0 stays 0
```

</td></tr>
<tr><td>

`CVAL` == `NCVAL - 1`, <br>aka<br> `CVAL + 1` == `NCVAL`

</td><td>

```asm
	; 3 bytes, 3 cycles
	sbc a     ; if carry, then $ff aka -1, else 0
	add NCVAL ; -1 becomes NCVAL - 1 aka CVAL, 0 becomes NCVAL
```

</td></tr>
<tr><td>

`CVAL` == `NCVAL - 2`, <br>aka<br> `CVAL + 2` == `NCVAL`

</td><td>

```asm
	; 3 bytes, 3 cycles
	sbc a      ; if carry, then $ff aka -1, else 0; doesn't change the carry flag
	sbc -NCVAL ; -1 becomes NCVAL - 2 aka CVAL, 0 becomes NCVAL
```

</td></tr>
<tr><td>

`CVAL` == 0

</td><td>

```asm
	; 4 bytes, 4 cycles
	ccf       ; invert carry flag
	sbc a     ; if originally carry, then 0, else $ff
	and NCVAL ; 0 stays 0, $ff becomes NCVAL
```

</td></tr>
<tr><td>

`NCVAL` == $FF (aka −1)

</td><td>

```asm
	; 4 bytes, 4 cycles
	ccf     ; invert carry flag
	sbc a   ; if originally carry, then 0, else $ff
	or CVAL ; $00 becomes CVAL, $ff stays $ff
```

</td></tr>
<tr><td>

`CVAL` == `NCVAL + 1`, <br>aka<br> `CVAL - 1` == `NCVAL`

</td><td>

```asm
	; 4 bytes, 4 cycles
	ccf      ; invert carry flag
	sbc a    ; if originally carry, then 0, else $ff aka -1
	add CVAL ; -1 becomes CVAL - 1 aka NCVAL, 0 becomes CVAL
```

</td></tr>
<tr><td>

`CVAL` == `NCVAL + 2`, <br>aka<br> `CVAL - 2` == `NCVAL`

</td><td>

```asm
	; 4 bytes, 4 cycles
	ccf       ; invert carry flag
	sbc a     ; if carry, then 0, else $ff aka -1; doesn't change the carry flag
	sbc -CVAL ; -1 becomes CVAL - 2 aka NCVAL, 0 becomes CVAL
```

</td></tr>

</table>


### Increment or decrement `a` when the carry flag is set

Don't do this:

```asm
	; 3 bytes, 3 cycles
	jr nc, .ok
	inc a
.ok
```

```asm
	; 3 bytes, 3 cycles
	jr nc, .ok
	dec a
.ok
```

Instead, do this:

```asm
	adc 0 ; 2 bytes, 2 cycles
```

```asm
	sbc 0 ; 2 bytes, 2 cycles
```


### Toggle `a` between two different constants

Don't do this:

```asm
	; 12 bytes, 9 or 10 cycles
	cp FOO
	jr z, .foo_to_bar
	jr .bar_to_foo
.foo_to_bar
	ld a, BAR
	jr .done
.bar_to_foo
	ld a, FOO
.done
	...
```

And don't do this:

```asm
	; 10 bytes, 7 or 9 cycles
	cp FOO
	jr z, .foo_to_bar ; nor jr nz, .bar_to_foo
	ld a, FOO         ; nor ld a, BAR
	jr .done
.foo_to_bar               ; nor .bar_to_foo
	ld a, BAR         ; nor ld a, FOO
.done
	...
```

(That would be applying the "[Conditional fallthrough](#conditional-fallthrough)" optimization to the first way.)

Instead, do this:

```asm
	xor FOO ^ BAR ; 2 bytes, 2 cycles
```

(This works for the same reason as the [XOR swap algorithm](https://en.wikipedia.org/wiki/XOR_swap_algorithm) for swapping the values of two variables.)


### Divide `a` by 8 (shift `a` right 3 bits)

Don't do this:

```asm
	; 6 bytes, 9 cycles
	; (15 bytes, at least 21 cycles, counting the definition of SimpleDivide)
	ld c, 8 ; divisor
	call SimpleDivide
	ld a, b ; quotient
```

And don't do this:

```asm
	; 6 bytes, 6 cycles
	srl a
	srl a
	srl a
```

Instead, do this:

```asm
	; 5 bytes, 5 cycles
	rrca
	rrca
	rrca
	and %00011111
```


### Divide `a` by 16 (shift `a` right 4 bits)

Don't do this:

```asm
	; 6 bytes, 9 cycles
	; (15 bytes, at least 21 cycles, counting the definition of SimpleDivide)
	ld c, 16 ; divisor
	call SimpleDivide
	ld a, b ; quotient
```

And don't do this:

```asm
	; 8 bytes, 8 cycles
	srl a
	srl a
	srl a
	srl a
```

Instead, do this:

```asm
	; 4 bytes, 4 cycles
	swap a
	and $f
```


### Set `a` to some value plus or minus carry

(The example uses `b` and `c`, but any registers besides `a` would also work, including `[hl]`.)

Don't do this:

```asm
	; 4 bytes, 4 cycles
	ld b, a
	ld a, c
	adc 0
```

```asm
	; 4 bytes, 4 cycles
	ld b, a
	ld a, c
	sbc 0
```

And don't do this:

```asm
	; 4 bytes, 4 cycles
	ld b, a
	ld a, 0
	adc c
```

```asm
	; 4 bytes, 4 cycles
	ld b, a
	ld a, 0
	sbc c
```

Instead, do this:

```asm
	; 3 bytes, 3 cycles
	ld b, a
	adc c
	sub b
```

```asm
	; 3 bytes, 3 cycles
	ld b, a
	sbc c
	add b
```

Also, don't do this:

```asm
	; 5 bytes, 5 cycles
	ld b, a
	ld a, N
	adc 0
```

```asm
	; 5 bytes, 5 cycles
	ld b, a
	ld a, N
	sbc 0
```

And don't do this:

```asm
	; 5 bytes, 5 cycles
	ld b, a
	ld a, 0
	adc N
```

```asm
	; 5 bytes, 5 cycles
	ld b, a
	ld a, 0
	sbc N
```

Instead, do this:

```asm
	; 4 bytes, 4 cycles
	ld b, a
	adc N
	sub b
```

```asm
	; 4 bytes, 4 cycles
	ld b, a
	sbc N
	add b
```

(If the original value of `a` was not backed up in `b`, this optimization would not apply.)


### Add or subtract the carry flag from a register besides `a`

(The example uses `b`, but any of `c`, `d`, `e`, `h`, or `l` would also work.)

Don't do this:

```asm
	; 4 bytes, 4 cycles
	ld a, b
	adc 0
	ld b, a
```

```asm
	; 4 bytes, 4 cycles
	ld a, b
	sbc 0
	ld b, a
```

And don't do this:

```asm
	; 4 bytes, 4 cycles
	ld a, 0
	adc b
	ld b, a
```

```asm
	; 4 bytes, 4 cycles
	ld a, 0
	sbc b
	ld b, a
```

Instead, do this:

```asm
	; 3 bytes, 3 or 4 cycles
	jr nc, .no_carry
	inc b
.no_carry
```

```asm
	; 3 bytes, 3 or 4 cycles
	jr nc, .no_carry
	dec b
.no_carry
```


### Load from HRAM to `a` or from `a` to HRAM

Don't do this:

```asm
	ld a, [hFoobar] ; 3 bytes, 4 cycles
```

```asm
	ld [hFoobar], a ; 3 bytes, 4 cycles
```

Instead, do this:

```asm
	ldh a, [hFoobar] ; 2 bytes, 3 cycles
```

```asm
	ldh [hFoobar], a ; 2 bytes, 3 cycles
```


## 16-bit registers


### Multiply `hl` by 2

Don't do this:

```asm
	; 4 bytes, 4 cycles
	sla l
	rl h
```

Instead, do this:

```asm
	add hl, hl ; 1 byte, 2 cycles
```


### Add `a` to a 16-bit register

(The example uses `hl`, but `bc` or `de` would also work.)

Don't do this:

```asm
	; 6 bytes, 6 cycles
	add l
	ld l, a
	ld a, 0
	adc h
	ld h, a
```

And don't do this:

```asm
	; 6 bytes, 6 cycles
	add l
	ld l, a
	ld a, h
	adc 0
	ld h, a
```

And don't do this:

```asm
	; 5 bytes, 5 cycles
	add l
	ld l, a
	jr nc, .no_carry
	inc h
.no_carry
```

Instead, do this:

```asm
	; 5 bytes, 5 cycles; no labels
	add l
	ld l, a
	adc h
	sub l
	ld h, a
```

Or if you can spare another 16-bit register and want to optimize for size over speed, then do this:

```asm
	; 4 bytes, 5 cycles
	ld d, 0
	ld e, a
	add hl, de
```


### Subtract an 8-bit constant from a 16-bit register

(The example uses `hl`, but `bc` or `de` would also work.)

Don't do this:

```asm
	; 8 bytes, 8 cycles
	ld a, l
	sub FOOBAR
	ld l, a
	ld a, h
	sbc 0
	ld h, a
```

Instead, do this:

```asm
	; 7 bytes, 7 or 8 cycles
	ld a, l
	sub FOOBAR
	ld l, a
	jr nc, .no_carry
	dec h
.no_carry
```

(This is a case of "[Add or subtract the carry flag from a register besides `a`](#add-or-subtract-the-carry-flag-from-a-register-besides-a)", applied to the high part of a 16-bit register.)

Or if you can spare another 16-bit register, do this:

```asm
	; 4 bytes, 5 cycles
	ld de, -FOOBAR
	add hl, de
```


### Set a 16-bit register to `a` plus a constant

(The example uses `hl`, but `bc` or `de` would also work.)

Don't do this:

```asm
	; 7 bytes, 8 cycles; uses another 16-bit register
	ld e, a
	ld d, 0
	ld hl, FooBar
	add hl, de
```

And don't do this:

```asm
	; 8 bytes, 8 cycles
	ld hl, FooBar
	add l
	ld l, a
	adc h
	sub l
	ld h, a
```

And don't do this:

```asm
	; 8 bytes, 8 cycles
	ld h, HIGH(FooBar)
	add LOW(FooBar)
	ld l, a
	jr nc, .no_carry
	inc h
.no_carry
```

Instead, do this:

```asm
	; 7 bytes, 7 cycles
	add LOW(FooBar)
	ld l, a
	adc HIGH(FooBar)
	sub l
	ld h, a
```

Or if the constant is 8-bit and nonzero (i.e. 0 < `FooBar` < 256), then do this:

```asm
	; 6 bytes, 6 cycles
	sub LOW(-FooBar)
	ld l, a
	sbc a
	inc a
	ld h, a
```

Or if the constant is zero (i.e. `FooBar` == 0 and `a` + `FooBar` == `a`), then do this:

```asm
	; 3 bytes, 3 cycles
	ld l, a
	ld h, 0
```


### Set a 16-bit register to `a` multiplied by 16

(The example uses `hl`, but `bc` or `de` would also work.)

You can do this:

```asm
	; 7 bytes, 11 cycles
	ld l, a
	ld h, 0
	add hl, hl
	add hl, hl
	add hl, hl
	add hl, hl
```

```asm
	; 7 bytes, 11 cycles
	ld l, a
	ld h, 0
rept 4
	add hl, hl
endr
```

But if `a` is definitely small enough, and its value can be changed, then do one of these:

```asm
	; 7 bytes, 10 cycles; sets a = a * 2; requires a < $80
	add a
	ld l, a
	ld h, 0
	add hl, hl
	add hl, hl
	add hl, hl
```

```asm
	; 7 bytes, 9 cycles; sets a = a * 4; requires a < $40
	add a
	add a
	ld l, a
	ld h, 0
	add hl, hl
	add hl, hl
```

```asm
	; 7 bytes, 8 cycles; sets a = a * 8; requires a < $20
	add a
	add a
	add a
	ld l, a
	ld h, 0
	add hl, hl
```

```asm
	; 5 bytes, 5 cycles; sets a = a * 16; requires a < $10
	swap a
	ld l, a
	ld h, 0
```

Or if the value of `a` can be changed and you want to optimize for speed over size, then do one of these:

```asm
	; 8 bytes, 8 cycles; sets a = l
	swap a
	ld l, a
	and $f
	ld h, a
	xor l
	ld l, a
```

```asm
	; 8 bytes, 8 cycles; sets a = h
	swap a
	ld h, a
	and $f0
	ld l, a
	xor h
	ld h, a
```


### Increment or decrement a 16-bit register

When possible, avoid doing this:

```asm
	inc hl ; 1 byte, 2 cycles
```

```asm
	dec hl ; 1 byte, 2 cycles
```

If the low byte *definitely* won't overflow, then do this:

```asm
	inc l ; 1 byte, 1 cycle
```

```asm
	dec l ; 1 byte, 1 cycle
```

This is applicable, for instance, if you're reading a data table via `hl` one byte at a time, it has no more than 256 entries, and it's in its own `SECTION` which has been `ALIGN`ed to 8 bits. It's unlikely to apply to pokecrystal's existing systems.


### Add or subtract the carry flag from a 16-bit register

(The example uses `hl`, but `bc` or `de` would also work.)

Don't do this:

```asm
	; 8 bytes, 8 cycles
	ld a, l ; nor ld a, 0
	adc 0   ; nor adc l
	ld l, a
	ld a, h ; nor ld a, 0
	adc 0   ; nor adc h
	ld h, a
```

```asm
	; 8 bytes, 8 cycles
	ld a, l ; nor ld a, 0
	sbc 0   ; nor sbc l
	ld l, a
	ld a, h ; nor ld a, 0
	sbc 0   ; nor sbc h
	ld h, a
```

And don't do this:

```asm
	; 7 bytes, 7 cycles
	ld a, l ; nor ld a, 0
	adc 0   ; nor adc l
	ld l, a
	adc h
	sub l
	ld h, a
```

```asm
	; 7 bytes, 7 cycles
	ld a, l ; nor ld a, 0
	sbc 0   ; nor sbc l
	ld l, a
	sbc h
	add l
	ld h, a
```

(That would be applying the "[Set `a` to some value plus or minus carry](#set-a-to-some-value-plus-or-minus-carry)" optimization to part of the first way.)

And don't do this:

```asm
	; 7 bytes, 7 or 8 cycles
	ld a, l ; nor ld a, 0
	adc 0   ; nor adc l
	ld l, a
	jr nc, .no_carry
	inc h
.no_carry
```

```asm
	; 7 bytes, 7 or 8 cycles
	ld a, l ; nor ld a, 0
	sbc 0   ; nor sbc l
	ld l, a
	jr nc, .no_carry
	dec h
.no_carry
```

(That would be applying the "[Add or subtract the carry flag from a register besides `a`](#add-or-subtract-the-carry-flag-from-a-register-besides-a)" optimization to part of the first way.)

Instead, do this:

```asm
	; 3 bytes, 4 or 5 cycles
	jr nc, .no_carry
	inc hl
.no_carry
```

```asm
	; 3 bytes, 4 or 5 cycles
	jr nc, .no_carry
	dec hl
.no_carry
```


### Load from an address to `hl`

Don't do this:

```asm
	; 8 bytes, 10 cycles
	ld a, [Address]  ; LSB first
	ld l, a
	ld a, [Address+1]
	ld h, a
```

Instead, do this:

```asm
	; 6 bytes, 8 cycles
	ld hl, Address
	ld a, [hli]
	ld h, [hl]
	ld l, a
```

And don't do this:

```asm
	; 8 bytes, 10 cycles
	ld a, [Address]  ; MSB first
	ld h, a
	ld a, [Address+1]
	ld l, a
```

Instead, do this:

```asm
	; 6 bytes, 8 cycles
	ld hl, Address
	ld a, [hli]
	ld l, [hl]
	ld h, a
```


### Load from an address to `sp`

Don't do this:

```asm
	; 9 bytes, 12 cycles
	ld a, [Address]
	ld l, a
	ld a, [Address+1]
	ld h, a
	ld sp, hl
```

And don't do this:

```asm
	; 7 bytes, 10 cycles
	ldh a, [hAddress]
	ld l, a
	ldh a, [hAddress+1]
	ld h, a
	ld sp, hl
```

And don't do this:

```asm
	; 7 bytes, 10 cycles
	ld hl, Address
	ld a, [hli]
	ld h, [hl]
	ld l, a
	ld sp, hl
```

(That would be applying the "[Load from an address to `hl`](#load-from-an-address-to-hl)" optimization to the first way.)

Instead, do this:

```asm
	; 5 bytes, 8 cycles
	ld sp, Address
	pop hl
	ld sp, hl
```

Or if the address is already in `hl`, then don't do this:

```asm
	; 4 bytes, 7 cycles
	ld a, [hli]
	ld h, [hl]
	ld l, a
	ld sp, hl
```

Instead, do this:

```asm
	; 3 bytes, 7 cycles
	ld sp, hl
	pop hl
	ld sp, hl
```


### Exchange two 16-bit registers

(The example uses `hl` and `de`, but any pair of `bc`, `de`, or `hl` would also work.)

If you care about speed, then do this:

```asm
	; 6 bytes, 6 cycles
	ld a, d
	ld d, h
	ld h, a
	ld a, e
	ld e, l
	ld l, a
```

If you care about size, then do this:

```asm
	; 4 bytes, 9 cycles
	push de
	ld d, h
	ld e, l
	pop hl
```


### Subtract two 16-bit registers

(The example uses `hl` and `de`, but any pair of `bc`, `de`, or `hl` would also work.)

Don't do this:

```asm
	; 9 bytes, 10 cycles; modifies subtrahend de
	ld a, $ff
	xor d
	ld d, a
	ld a, $ff
	xor e
	ld e, a
	add hl, de
```

And don't do this:

```asm
	; 7 bytes, 8 cycles; modifies subtrahend de
	ld a, d
	cpl
	ld d, a
	ld a, e
	cpl
	ld e, a
	add hl, de
```

Instead, do this:

```asm
	; 6 bytes, 6 cycles
	ld a, l
	sub e
	ld l, a
	ld a, h
	sbc d
	ld h, a
```


### Load two constants into a register pair

(The example uses `bc`, but `hl` or `de` would also work.)

Don't do this:

```asm
	; 4 bytes, 4 cycles
	ld b, FOO
	ld c, BAR
```

Instead, do this:

```asm
	ld bc, FOO << 8 | BAR ; 3 bytes, 3 cycles
```

Or better, use the `lb` macro in [macros/code.asm](../blob/master/macros/code.asm):

```asm
	lb bc, FOO, BAR ; 3 bytes, 3 cycles
```


### Load a constant into `[hl]`

Don't do this:

```asm
	; 3 bytes, 4 cycles
	ld a, FOOBAR
	ld [hl], a
```

Instead, do this:

```asm
	ld [hl], FOOBAR ; 2 bytes, 3 cycles
```


### Increment or decrement `[hl]`

Don't do this:

```asm
	; 3 bytes, 5 cycles
	ld a, [hl]
	inc a
	ld [hl], a
```

```asm
	; 3 bytes, 5 cycles
	ld a, [hl]
	dec a
	ld [hl], a
```

Instead, do this:

```asm
	inc [hl] ; 1 bytes, 3 cycles
```

```asm
	dec [hl] ; 1 bytes, 3 cycles
```


### Load a constant into `[hl]` and increment or decrement `hl`

Don't do this:

```asm
	; 2 bytes, 4 cycles
	ld [hl], a
	inc hl
```

```asm
	; 2 bytes, 4 cycles
	ld [hl], a
	dec hl
```

Instead, do this:

```asm
	ld [hli], a ; 1 bytes, 2 cycles
```

```asm
	ld [hld], a ; 1 bytes, 2 cycles
```

And if you can use `a`, then don't do this:

```asm
	; 3 bytes, 5 cycles
	ld [hl], FOO
	inc hl
```

```asm
	; 3 bytes, 5 cycles
	ld [hl], FOO
	dec hl
```

Instead, do this:

```asm
	; 3 bytes, 4 cycles
	ld a, FOO
	ld [hli], a
```

```asm
	; 3 bytes, 4 cycles
	ld a, FOO
	ld [hld], a
```


## Branching (control flow)


### Relative jumps

Don't do this:

```asm
	jp Somewhere ; 3 bytes, 4 cycles
```

Instead, do this:

```asm
	jr Somewhere ; 2 bytes, 3 cycles
```

This only applies if `Somewhere` is within ±128 bytes of the jump.

You can define a `jmp` macro to use instead of `jp`, which will warn you when it can be `jr` instead:

```
jmp: MACRO
	if _NARG == 1
		jp \1
	else
		jp \1, \2
		shift
	endc
	assert warn, (\1) - @ > 127 || (\1) - @ < -129, "jp can be jr"
ENDM
```


### Compare `a` to 0

Don't do this:

```asm
	cp 0 ; 2 bytes, 2 cycles
```

And don't do this:

```asm
	or 0 ; 2 bytes, 2 cycles
```

And don't do this:

```asm
	and $ff ; 2 bytes, 2 cycles
```

Instead, do this:

```asm
	or a ; 1 byte, 1 cycle
```

Or do this:

```asm
	and a ; 1 byte, 1 cycle
```


### Compare `a` to 1

Do this:

```asm
	cp 1 ; 2 bytes, 2 cycles; updates Z and C flags
```

Or if you don't care about the value in `a`, and don't need to set the carry flag, then do this:


```asm
	dec a ; 1 byte, 1 cycle; decrements a, updates Z flag
```

Note that you can still do `inc a` afterwards, which is one cycle faster if the jump is taken. Compare this:

```asm
	; 4 bytes, 4 or 5 cycles
	cp 1
	jr z, .equals1
```

with this:

```asm
	; 4 bytes, 4 cycles
	dec a
	jr z, .equals1
	inc a
```


### Compare `a` to 255

(255, or $FF in hexadecimal, is the same as −1 due to [two's complement](https://en.wikipedia.org/wiki/Two%27s_complement).)

Do this:

```asm
	cp $ff ; 2 bytes, 2 cycles; updates Z and C flags
```

Or if you don't care about the value in `a`, and don't need to set the carry flag, then do this:

```asm
	inc a ; 1 byte, 1 cycle; increments a, updates Z flag
```

Note that you can still do `dec a` afterwards, which is one cycle faster if the jump is taken. Compare this:

```asm
	; 4 bytes, 4 or 5 cycles
	cp $ff
	jr z, .equals255
```

with this:

```asm
	; 4 bytes, 4 cycles
	inc a
	jr z, .equals255
	dec a
```


### Compare `a` to 0 after masking it

Don't do this:

```asm
	; 3 bytes, 3 cycles; sets zero flag if a == 0
	and MASK
	and a
```

Instead, do this:

```asm
	and MASK ; 2 bytes, 2 cycles; sets zero flag if a == 0
```


### Compare `a` to a mask after masking it

Don't do this:

```asm
	; 4 bytes, 4 cycles; sets zero flag if a == MASK and carry flag if a < MASK
	and MASK
	cp MASK
```

If you don't need to set the carry flag, and don't need the masked value of `a`, then do this:

```asm
	; 3 bytes, 3 cycles; sets zero flag if a was equal to MASK
	or ~MASK
	inc a
```


### Test whether `a` is negative (compare `a` to $80)

If you don't need to preserve the value in `a`, then don't do this:

```asm
	; 4 bytes, 4 or 5 cycles
	cp $80
	jr nc, .negative
```

And don't do this:

```asm
	; 4 bytes, 4 or 5 cycles
	bit 7, a
	jr nz, .negative
```

Instead, do this:

```asm
	; 3 bytes, 3 or 4 cycles; modifies a
	rlca
	jr c, .negative
```


## Subroutines (functions)


### Tail call optimization

Don't do this:

```asm
	; 4 bytes, 10 cycles
	call Function
	ret
```

Instead, do this:

```asm
	jp Function ; 3 bytes, 4 cycles
```


### Call `hl`

Don't do this:

```asm
	; 5 bytes, 8 cycles
	(some code)
	ld de, .return
	push de
	jp hl

.return:
	(some more code)
```

Instead, do this:

```asm
	; 3 bytes, 6 cycles
	; (4 bytes, 7 cycles, counting the definition of _hl_)
	(some code)
	call _hl_
	(some more code)
```

`_hl_` is a routine already defined in [home/call_regs.asm](../blob/master/home/call_regs.asm):

```asm
_hl_::
	jp hl
```


### Inlining

Don't do this:

```asm
	; 4 additional bytes, 10 additional cycles
	(some code)
	call Function
	(some more code)

Function:
	(function code)
	ret
```

if `Function` is only called a handful of times. Instead, do:

```asm
	(some code)

	; Function
	(function code)

	(some more code)
```

You shouldn't do this if `Function` used any `ret`urns besides the one at the very end, or if inlining its code would make some `jr`s too distant from their targets.


### Fallthrough

Don't do this:

```asm
	(some code)
	call Function
	ret

Function:
	(function code)
	ret
```

And don't do this:

```asm
	(some code)
	jp Function

Function:
	(function code)
	ret
```

Instead, do this:

```asm
	(some code)
	; fallthrough
Function:
	(function code)
	ret
```

Fallthrough is what you get when you combine inlining with tail calls.  You can still `call Function` elsewhere, but one tail call can be optimized into a fallthrough.


## Conditional fallthrough

(The example uses `z`, but `nz`, `c`, or `nc` would also work.)

Don't do this:

```asm
	(some code)
	jr z, .foo
	jr .bar

.foo
	(foo code)

.bar
	(bar code)
```

Instead, do this:

```asm
	(some code)
	jr nz, .bar
	; fallthrough
.foo
	(foo code)

.bar
	(bar code)
```


## Conditional return

(The example uses `z`, but `nz`, `c`, or `nc` would also work.)

Don't do this:

```asm
	; 3 bytes, 3 or 6 cycles
	jr z, .skip
	ret
.skip
	...
```

And don't do this:

```asm
	; 3 bytes, 7 or 2 cycles
	jr nz, .return
	...

.return
	ret
```

Instead, do this:

```asm
	; 1 byte, 5 or 2 cycles
	ret nz
	...
```


### Conditional call

(The example uses `z`, but `nz`, `c`, or `nc` would also work.)

Don't do this:

```asm
	; 5 bytes, 3 or 9 cycles
	jr nz, .skip
	call Foo
.skip
```

Instead, do this:

```asm
	; 3 bytes, 6 or 3 cycles
	call z, Foo
```

And don't do this:

```asm
	; 5 bytes, 3 or 9 cycles
	jr nz, .skip
	jp Foo
.skip
```

Instead, do this:

```asm
	; 3 bytes, 6 or 3 cycles
	jp z, Foo
```


### Conditional `rst $38`

(The example uses `z`, but `nz`, `c`, or `nc` would also work.)

Don't do this:

```asm
	; 5 bytes, 3 or 14 cycles
	call z, RstVector38
	...

RstVector38:
	rst $38
	ret
```

And don't do this:

```asm
	; 3 bytes, 3 or 6 cycles
	jr nz, .no_rst_38
	rst $38
.no_rst_38
	...
```

And don't do this:

```asm
	; 3 bytes, 3 or 6 cycles
	call z, $0038
	...
```

Instead, do this:

```asm
	; 2 bytes, 2 or 7 cycles
	jr z, @ + 1 ; the byte for @ + 1 is $ff, which is the opcode for rst $38
	...
```

(The label `@` evaluates to the current `pc` value, which in `jr z, @ + 1` is right before the `jr` instruction. The instruction consists of two bytes, the opcode and the relative offset. `@ + 1` evaluates to in-between those two bytes. The `jr` instruction encodes its offset relative to the *end* of the instruction, i.e. the *next* `pc` value after the instruction has been read, so the relative offset is `-1`, aka `$ff`.)


### Enable interrupts and return

Don't do this:

```asm
	; 2 bytes, 5 cycles
	ei
	ret
```

Instead, do this:

```asm
	; 1 byte, 4 cycles
	reti
```


## Jump and lookup tables


### Chain comparisons

Don't do this:

```asm
	cp 1
	jr z, .equals1
	cp 2
	jr z, .equals2
	cp 3
	jr z, .equals3
	...
```

Instead, do this:

```asm
	dec a
	jr z, .equals1
	dec a
	jr z, .equals2
	dec a
	jr z, .equals3
	...
```

Or do this:

```asm
	dec a
	ld hl, .jumptable
	ld e, a
	ld d, 0
	add hl, de
	add hl, de
	ld a, [hli]
	ld h, [hl]
	ld l, a
	jp hl

.jumptable:
	dw .equals1
	dw .equals2
	dw .equals3
	...
```

Or better, do:

```asm
	dec a
	ld hl, .jumptable
	rst JumpTable
	ret

.jumptable:
	dw .equals1
	dw .equals2
	dw .equals3
	...
```

`JumpTable` is an `rst` routine already defined in [home/header.asm](../blob/master/home/header.asm):

```asm
JumpTable::
	push de
	ld e, a
	ld d, 0
	add hl, de
	add hl, de
	ld a, [hli]
	ld h, [hl]
	ld l, a
	pop de
	jp hl
```


### Off-by-one `AddNTimes`

Don't do this:

```asm
	ld hl, Foo
	ld bc, BAR
	dec a
	call AddNTimes
```

Instead, as long as you don't need to add 255 times when a is 0, then do this:

```asm
	ld hl, Foo - BAR
	ld bc, BAR
	call AddNTimes
```