On GitHub runners, VS 2019 16.11 (MSVC 19.29.30158) results in
test failures. VS 2022 17.13 (MSVC 19.43.34808) works.
In xz 5.6.x there was a #pragma-based workaround for MSVC builds for
32-bit x86. Another method was thought to work with the new rewritten
CLMUL CRC. Apparently it doesn't. Keep it simple and disable CLMUL CRC
with any non-recent MSVC when building for 32-bit x86.
Fixes: 54eaea5ea49b ("liblzma: x86 CLMUL CRC: Rewrite")
Fixes: https://github.com/tukaani-project/xz/issues/171
Reported-by: Andrew Murray
This reverts commit dc03f6290f5b9bd3d50c7e12e58dee870889d599.
OpenBSD 7.6 will support elf_aux_info(3), and the detection code used
on FreeBSD will work on OpenBSD 7.6 too. Keep things simpler and drop
the OpenBSD-specific sysctl() method.
Thanks to Christian Weisgerber.
The crc.w.{b/h/w/d}.w instructions in LoongArch can calculate the CRC32
result for 1/2/4/8 bytes in a single operation. Using these is much
faster compared to the generic method.
Optimized CRC32 is enabled unconditionally on 64-bit LoongArch because
the LoongArch specification says that CRC32 instructions shall be
implemented for 64-bit processors. Optimized CRC32 isn't enabled for
32-bit LoongArch processors because not enough information is available
about them.
Co-authored-by: Lasse Collin <lasse.collin@tukaani.org>
Closes: https://github.com/tukaani-project/xz/pull/86
Prefix ARM64_RUNTIME_DETECTION with CRC_ and reorder it to be with
the other ARM64-specific lines. That macro isn't used outside this
file.
ARM64 CLMUL implementation doesn't exist yet and thus CRC64_ARM64_CLMUL
isn't used anywhere yet.
It's not ideal that the single-letter CRC utility macros are here
as they pollute the namespace of the LZ encoder files. Those could
be moved their own crc_macros.h like they were in 5.2.x but in practice
this is fine enough already.
LZ encoder needs lzma_crc32_table[0] but otherwise those tables
are private to the CRC code. In contrast, the other things in
check.h are needed in several places.
Now runtime detection of CLMUL support can pick between the CLMUL and
the generic assembly implementations. Whatever overhead this has for
builds that omit CLMUL completely isn't important because builds for
any non-ancient system is likely to include the CLMUL code too.
Handle the CRC tables in crcXX_fast.c files because now these files
are built even when assembly code is used.
If 32-bit x86 assembly is enabled then it will always be built even
if compiler flags were such that CLMUL would be allowed unconditionally.
That is, runtime detection will be used anyway. This keeps the build
rules simpler.
In LZ encoder, build and use lzma_lz_hash_table[256] if CLMUL CRC
is used without runtime detection. Previously this wasn't needed
because crc32_table.c included the lzma_crc32_table[][] in the build
unless encoder support had been disabled. Including an 8 KiB table
was silly when only 1 KiB is actually used. So now liblzma is 7 KiB
smaller if CLMUL is enabled without runtime detection.
On E2K the function compiles only due to compiler emulation but the
function is never used. It's cleaner to omit the function when it's
not needed even though it's a "static inline" function.
Thanks to Ilya Kurdyukov.
It's faster with both tiny and large buffers and doesn't require
disabling any sanitizers. With large buffers the extra speed is
from folding four 16-byte chunks in parallel.
The 32-bit x86 with MSVC reportedly still needs a workaround.
Now the simpler "__asm mov ebx, ebx" trick is enough but it
needs to be in lzma_crc64() instead of crc64_arch_optimized().
Thanks to Iouri Kharon for testing and the fix.
Thanks to Ilya Kurdyukov for testing the speed with aligned and
unaligned buffers on a few x86 processors and on E2K v6.
Thanks to Sam James for general feedback.
Fixes: https://github.com/tukaani-project/xz/issues/112
Fixes: https://github.com/tukaani-project/xz/issues/122
Now it refers to crc_clmul_consts_gen.c. vfold8 was renamed to mu_p
and the p no longer has the lowest bit set (it makes no difference
as the output bits it affects are ignored).
It's not enough to silence the address sanitizer. Also memory and
thread sanitizers would need to be silenced. They, at least currently,
aren't smart enough to see that the extra bytes are discarded from
the xmm registers by later instructions.
Valgrind is smarter, possibly because this kind of code isn't weird
to write in assembly. Agner Fog's optimizing_assembly.pdf even mentions
this idea of doing an aligned read and then discarding the extra
bytes. The sanitizers don't instrument assembly code but Valgrind
checks all code.
It's better to change the implementation to avoid the sanitization
attributes which also look scary in the code. (Somehow they can look
more scary than __asm__ which is implictly unsanitized.)
See also:
https://github.com/tukaani-project/xz/issues/112https://github.com/tukaani-project/xz/issues/122
The C code is from Christian Weisgerber, I merely reordered the OSes.
Then I added the build system checks without testing them.
Also thanks to Brad Smith who submitted a similar patch on GitHub
a few hours after Christian had sent his via email.
Co-authored-by: Christian Weisgerber <naddy@mips.inka.de>
Closes: https://github.com/tukaani-project/xz/pull/125
The __builtin_bswapXX from GCC and Clang are preferred when
they are available. This can allow compilers to emit the x86 MOVBE
instruction instead of doing a load + byteswap as two instructions
(which would happen if the byteswapping is done in inline asm).
bswap16, bswap32, and bswap64 exist in system headers on *BSDs
and Darwin. #defining bswap16 on NetBSD results in a warning about
macro redefinition. It's safest to avoid this namespace conflict
completely.
No OS supported by tuklib_integer.h uses byteswapXX names and
a web search doesn't immediately find any obvious danger of
namespace conflicts. So let's try these still-pretty-short names
for the macros.
Thanks to Sam James for pointing out the compiler warning on
NetBSD 10.0.
This is *NOT* done for security reasons even though the backdoor
relied on the ifunc code. Instead, the reason is that in this
project ifunc provides little benefits but it's quite a bit of
extra code to support it. The only case where ifunc *might* matter
for performance is if the CRC functions are used directly by an
application. In normal compression use it's completely irrelevant.
While the backdoor was inactive (and thus harmless) without inserting
a small trigger code into the build system when the source package was
created, it's good to remove this anyway:
- The executable payloads were embedded as binary blobs in
the test files. This was a blatant violation of the
Debian Free Software Guidelines.
- On machines that see lots bots poking at the SSH port, the backdoor
noticeably increased CPU load, resulting in degraded user experience
and thus overwhelmingly negative user feedback.
- The maintainer who added the backdoor has disappeared.
- Backdoors are bad for security.
This reverts the following without making any other changes:
6e636819 Tests: Update two test files.
a3a29bbd Tests: Test --single-stream can decompress bad-3-corrupt_lzma2.xz.
0b4ccc91 Tests: Update RISC-V test files.
8c9b8b20 liblzma: Fix typos in crc32_fast.c and crc64_fast.c.
82ecc538 liblzma: Fix false Valgrind error report with GCC.
cf44e4b7 Tests: Add a few test files.
3060e107 Tests: Use smaller dictionary size in RISC-V test files.
e2870db5 Tests: Add two RISC-V Filter test files.
The RISC-V test files also have real content that tests the filter
but the real content would fit into much smaller files. A generator
program would need to be available as well.
Thanks to Andres Freund for finding and reporting it and making
it public quickly so others could act without a delay.
See: https://www.openwall.com/lists/oss-security/2024/03/29/4
With GCC and a certain combination of flags, Valgrind will falsely
trigger an invalid write. This appears to be due to the omission of
instructions to properly save, set up, and restore the frame pointer.
The IFUNC resolver is a leaf function since it only calls a function
that is inlined. So sometimes GCC omits the frame pointer instructions
in the resolver unless this optimization is explictly disabled.
This fixes https://bugzilla.redhat.com/show_bug.cgi?id=2267598.
Perhaps the generated files aren't even copyrightable but
using the same license for them as for the rest of the liblzma
keeps things more consistent for tools that look for license info.
The initial commit 5d018dc03549c1ee4958364712fb0c94e1bf2741
in 2007 had a comment in sha256.c that the code is based on
Crypto++ Library 5.5.1. In 2009 the Authors list in sha256.c
and the AUTHORS file was updated with information that the
code had come from Crypto++ but via 7-Zip. I know I had viewed
7-Zip's SHA-256 code but back then the C code has been identical
enough with Crypto++, so I don't why I thought the author info
would need that extra step via 7-Zip for this single file.
Another error is that I had mixed sha.* and shacal2.* files
when checking for author info in Crypto++. The shacal2.* files
aren't related to liblzma's sha256.c and thus Kevin Springle's
code in Crypto++ isn't either.
If liblzma is configured with --disable-clmul-crc
CFLAGS="-msse4.1 -mpclmul", then it will fail to compile because the
generic version must be used but the CRC tables were not included.
The code was using HAVE_FUNC_ATTRIBUTE_IFUNC instead of CRC_USE_IFUNC.
With ARM64, ifunc is incompatible because it requires non-inline
function calls for runtime detection.
Even though the proper name for the architecture is aarch64, this
project uses ARM64 throughout. So the rename is for consistency.
Additionally, crc32_arm64.h was slightly refactored for the following
changes:
* Added MSVC, FreeBSD, and macOS support in
is_arch_extension_supported().
* crc32_arch_optimized() now checks the size when aligning the
buffer.
* crc32_arch_optimized() loop conditions were slightly modified to
avoid both decrementing the size and incrementing the buffer
pointer.
* Use the intrinsic wrappers defined in <arm_acle.h> because GCC and
Clang name them differently.
* Minor spacing and comment changes.
The CRC_GENERIC is now split into CRC32_GENERIC and CRC64_GENERIC, since
the ARM64 optimizations will be different between CRC32 and CRC64.
For the same reason, CRC_ARCH_OPTIMIZED is split into
CRC32_ARCH_OPTIMIZED and CRC64_ARCH_OPTIMIZED.
ifunc will only be used with x86-64 CLMUL because the runtime detection
methods needed with ARM64 are not compatible with ifunc.
The CRC32 instructions in ARM64 can calculate the CRC32 result
for 8 bytes in a single operation, making the use of ARM64
instructions much faster compared to the general CRC32 algorithm.
Optimized CRC32 will be enabled if ARM64 has CRC extension
running on Linux.
Signed-off-by: Chenxi Mao <chenxi.mao2013@gmail.com>
Now crc_simd_body() in crc_x86_clmul.h is only called once
in a translation unit, we no longer need to be so cautious
about ensuring the always-inline behavior.