It's faster with both tiny and large buffers and doesn't require
disabling any sanitizers. With large buffers the extra speed is
from folding four 16-byte chunks in parallel.
The 32-bit x86 with MSVC reportedly still needs a workaround.
Now the simpler "__asm mov ebx, ebx" trick is enough but it
needs to be in lzma_crc64() instead of crc64_arch_optimized().
Thanks to Iouri Kharon for testing and the fix.
Thanks to Ilya Kurdyukov for testing the speed with aligned and
unaligned buffers on a few x86 processors and on E2K v6.
Thanks to Sam James for general feedback.
Fixes: https://github.com/tukaani-project/xz/issues/112
Fixes: https://github.com/tukaani-project/xz/issues/122
Now it refers to crc_clmul_consts_gen.c. vfold8 was renamed to mu_p
and the p no longer has the lowest bit set (it makes no difference
as the output bits it affects are ignored).
It's not enough to silence the address sanitizer. Also memory and
thread sanitizers would need to be silenced. They, at least currently,
aren't smart enough to see that the extra bytes are discarded from
the xmm registers by later instructions.
Valgrind is smarter, possibly because this kind of code isn't weird
to write in assembly. Agner Fog's optimizing_assembly.pdf even mentions
this idea of doing an aligned read and then discarding the extra
bytes. The sanitizers don't instrument assembly code but Valgrind
checks all code.
It's better to change the implementation to avoid the sanitization
attributes which also look scary in the code. (Somehow they can look
more scary than __asm__ which is implictly unsanitized.)
See also:
https://github.com/tukaani-project/xz/issues/112https://github.com/tukaani-project/xz/issues/122
This is *NOT* done for security reasons even though the backdoor
relied on the ifunc code. Instead, the reason is that in this
project ifunc provides little benefits but it's quite a bit of
extra code to support it. The only case where ifunc *might* matter
for performance is if the CRC functions are used directly by an
application. In normal compression use it's completely irrelevant.
Now crc_simd_body() in crc_x86_clmul.h is only called once
in a translation unit, we no longer need to be so cautious
about ensuring the always-inline behavior.
CRC_CLMUL was split to CRC_ARCH_OPTIMIZED and CRC_X86_CLMUL.
CRC_ARCH_OPTIMIZED is defined when an arch-optimized version is used.
Currently the x86 CLMUL implementations are the only arch-optimized
versions, and these also use the CRC_x86_CLMUL macro to tell when
crc_x86_clmul.h needs to be included.
is_clmul_supported() was renamed to is_arch_extension_supported().
crc32_clmul() and crc64_clmul() were renamed to
crc32_arch_optimized() and crc64_arch_optimized().
This way the names make sense with arch-specific non-CLMUL
implementations as well.
A CLMUL-only build will have the crcxx_clmul() inlined into
lzma_crcxx(). Previously a jump to the extern lzma_crcxx_clmul()
was needed. Notes about shared liblzma on ELF platforms:
- On platforms that support ifunc and -fvisibility=hidden, this
was silly because CLMUL-only build would have that single extra
jump instruction of extra overhead.
- On platforms that support neither -fvisibility=hidden nor linker
version script (liblzma*.map), jumping to lzma_crcxx_clmul()
would go via PLT so a few more instructions of overhead (still
not a big issue but silly nevertheless).
There was a downside with static liblzma too: if an application only
needs lzma_crc64(), static linking would make the linker include the
CLMUL code for both CRC32 and CRC64 from crc_x86_clmul.o even though
the CRC32 code wouldn't be needed, thus increasing code size of the
executable (assuming that -ffunction-sections isn't used).
Also, now compilers are likely to inline crc_simd_body()
even if they don't support the always_inline attribute
(or MSVC's __forceinline). Quite possibly all compilers
that build the code do support such an attribute. But now
it likely isn't a problem even if the attribute wasn't supported.
Now all x86-specific stuff is in crc_x86_clmul.h. If other archs
The other archs can then have their own headers with their own
is_clmul_supported() and crcxx_clmul().
Another bonus is that the build system doesn't need to care if
crc_clmul.c is needed.
is_clmul_supported() stays as inline function as it's not needed
when doing a CLMUL-only build (avoids a warning about unused function).