Not all RISC-V processors support fast unaligned access so
it's better to read only one byte in the main loop. This can
be faster even on x86-64 when compared to reading 32 bits at
a time as half the time the address is only 16-bit aligned.
The downside is larger code size on archs that do support
fast unaligned access.