Commit Graph

1424 Commits

Author SHA1 Message Date
Lasse Collin 4eae859ae8
Windows: Disable MinGW-w64's stdio functions in size-optimized builds
This only affects builds with UCRT. With legacy MSVCRT, the replacement
functions are always enabled.

Omitting the MinGW-w64 replacements saves over 20 KiB per executable.
The downside is that --enable-small or XZ_SMALL=ON disables thousand
separator support in xz messages. If someone is OK with the slower
speed of slightly smaller builds, lack of thousand separators won't
matter.

Don't override __USE_MINGW_ANSI_STDIO if it is already defined (via
CPPFLAGS or such method).
2025-01-22 15:39:05 +02:00
Lasse Collin a831bc185b
liblzma: Add raw ARM64, RISC-V, and x86 BCJ filter APIs
Put them behind the LZMA_UNSTABLE macro for now.

These low-level special APIs might become useful in erofs-utils.
2025-01-20 16:44:27 +02:00
Lasse Collin 6f5cdd4534
xz: Unify a few strings with liblzma
Avoid having both "%s: foo" and "foo" as translatable strings
so that translators don't need to handle it twice.
2025-01-20 16:31:49 +02:00
Lasse Collin 713fdaa8b0
xz: Translate error messages from lzma_str_to_filters()
liblzma doesn't use gettext but the messages are included in xz.pot,
so xz can translate the messages.
2025-01-20 16:31:49 +02:00
Lasse Collin f2e2b267ca
liblzma: Mark string conversion messages as translatable 2025-01-20 16:31:49 +02:00
Lasse Collin f49d7413d9
liblzma: Tweak a few error messages in lzma_str_to_filters() 2025-01-20 16:31:35 +02:00
Lasse Collin 51f038f8cb
liblzma: memcmplen.h: Use 8-byte method on 64-bit unaligned archs
Previously it was enabled only on x86-64 and ARM64 when also support
for unaligned access was detected or manually enabled at built time.

In the default build configuration, the 8-byte method is now enabled
also on 64-bit RISC-V and 64-bit PowerPC (both endiannesses). It was
reported that on big endian POWER9, encoding time may reduce 12-13 %.

This change only affects builds with GCC and Clang because the code
uses __builtin_ctzll or __builtin_clzll.

Thanks to Marcus Comstedt for testing on POWER9.
2025-01-13 08:44:58 +02:00
Lasse Collin 150356207c
liblzma: Fix the encoder breakage on big endian ARM64
When the 8-byte method was enabled for ARM64, a check for endianness
wasn't added. This broke the LZMA/LZMA2 encoder. Test suite caught it.

Fixes: cd64dd70d5
Co-authored-by: Marcus Comstedt <marcus@mc.pp.se>
2025-01-12 13:08:55 +02:00
Lasse Collin b01b095802
Windows: Update manifest comments about long UTF-8 filenames 2025-01-12 13:08:55 +02:00
Lasse Collin 75d91d6b39
xz: Workaround broken O_SEARCH in musl
Testing with musl 1.2.5 and Linux 6.12, O_SEARCH doesn't result
in a file descriptor that works with fsync() although it should work.
See the added comment.

The same issue affected gzip --synchronous:

    https://bugs.gnu.org/75405

Thanks to Paul Eggert.
2025-01-08 19:20:28 +02:00
Lasse Collin ea92eae122
Revert "xz: O_SEARCH cannot be used for fsync()"
This reverts commit 4014e2479c.

POSIX-conforming O_SEARCH should allow fsync().
2025-01-08 19:20:21 +02:00
Lasse Collin 4014e2479c
xz: O_SEARCH cannot be used for fsync()
Opening a directory with O_SEARCH results in a file descriptor that can
be used with functions like openat(). Such a file descriptor cannot be
used with fsync(). Use O_RDONLY instead.

In musl, O_SEARCH becomes Linux-specific O_PATH. A file descriptor
from O_PATH doesn't allow fsync().

Seems that it's not possible to fsync() a directory that has write
and search permissions but not read permission.

Fixes: 2a9e91d796
2025-01-05 21:43:11 +02:00
Lasse Collin c405264c03
tuklib_mbstr_nonprint: Preserve the value of errno
A typical use case is like this:

    printf("%s: %s\n", tuklib_mask_nonprint(filename), strerror(errno));

tuklib_mask_nonprint() may call mbrtowc() and malloc() which may modify
errno. If errno isn't preserved, the error message might be wrong if
a compiler decides to call tuklib_mask_nonprint() before strerror().

Fixes: 40e5733055
2025-01-05 20:16:09 +02:00
Lasse Collin 2a9e91d796
xz: Use fsync() before deleting the input file, and add --no-sync
xz's default behavior is to delete the input file after successful
compression or decompression (unless writing to standard output).
If the system crashes soon after the deletion, it is possible that
the newly written file has not yet hit the disk while the previous
delete operation might have. In that case neither the original file
nor the written file is available.

Call fsync() on the file. On POSIX systems, sync also the directory
where the file was created.

Add a new option --no-sync which disables fsync() usage. It can avoid
a (possibly significant) performance penalty when processing many
small files. It's fine to use --no-sync when one knows that the files
are easy to recreate or restore after a system crash.

Using fsync() after every flush initiated by --flush-timeout was
considered. It wasn't implemented at least for now.

  - --flush-timeout is typically used when writing to stdout. If stdout
    is a file, xz cannot (portably) sync the directory of the file.
    One would need to create the output file first, sync the directory,
    and then run xz with fsync() enabled.

  - If xz --flush-timeout output goes to a file, it's possible to use
    a separate script to sync the file, for example, once per minute
    while telling xz to flush more frequently.

  - Not supporting syncing with --flush-timeout was simpler.

Portability notes:

  - On systems that lack O_SEARCH (like Linux), "xz dir/file" will now
    fail if "dir" cannot be opened for reading. If "dir" still has
    write and search permissions (like d-wx------ in "ls -l"),
    previously xz would have been able to compress "dir/file" still.
    Now it only works if using --no-sync (or --keep or --stdout).

  - <libgen.h> and dirname() should be available on all POSIX systems,
    and aren't needed on non-POSIX systems.

  - fsync() is available on all POSIX systems. The directory syncing
    could be changed to fdatasync() although at least on ext4 it
    doesn't seem to make a performance difference in xz's usage.
    fdatasync() would need a build system check to support (old)
    special cases, for example, MINIX 3.3.0 doesn't have fdatasync()
    and Solaris 10 needs -lrt.

  - On native Windows, _commit() is used to replace fsync(). Directory
    syncing isn't done and shouldn't be needed. (In Cygwin, fsync() on
    directories is a no-op.)

  - DJGPP has fsync() for files. ;-)

Using fsync() was considered somewhere around 2009 and again in 2016 but
those times the idea was rejected. For comparison, GNU gzip 1.7 (2016)
added the option --synchronous which enables fsync().

Co-authored-by: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Fixes: https://bugs.debian.org/814089
Link: https://www.mail-archive.com/xz-devel@tukaani.org/msg00282.html
Closes: https://github.com/tukaani-project/xz/pull/151
2025-01-05 20:16:08 +02:00
Lasse Collin 2e28c71457
xz: Use "goto" for error handling in io_open_dest_real() 2025-01-05 20:16:01 +02:00
Lasse Collin 7510721767
liblzma: Always validate the first digit of a preset string
lzma_str_to_filters() may call parse_lzma12_preset() in two ways. The
call from str_to_filters() detects the string type from the first
character(s) and as a side-effect it validates the first digit of
the preset string. So this change makes no difference there.

However, the call from parse_options() doesn't pre-validate the string.
parse_lzma12_preset() will return an invalid value which is passed to
lzma_lzma_preset() which safely rejects it. The bug still affects the
the error message:

    $ xz --filters=lzma2:preset=X
    xz: Error in --filters=FILTERS option:
    xz: lzma2:preset=X
    xz:               ^
    xz: Unsupported preset

After the fix:

    $ xz --filters=lzma2:preset=X
    xz: Error in --filters=FILTERS option:
    xz: lzma2:preset=X
    xz:              ^
    xz: Unsupported preset

The ^ now correctly points to the X and not past it because the X itself
is the problematic character.

Fixes: cedeeca2ea
2025-01-05 12:58:22 +02:00
Lasse Collin 52ff324337
xz: Fix getopt_long argument type in --filters*
Forgetting the argument (or not using = to separate the option from
the argument) resulted in lzma_str_to_filters() being called with NULL
as input string argument. The function handles it fine but xz passes
the NULL to printf() too:

    $ xz --filters
    xz: Error in --filters=FILTERS option:
    xz: (null)
    xz: ^
    xz: Unexpected NULL pointer argument(s) to lzma_str_to_filters()

Now it's correct:

    $ xz --filters
    xz: option '--filters' requires an argument

The --filters-help option doesn't take any arguments.

Fixes: 9ded880a02
Fixes: d6af7f3470
Fixes: a165d7df19
2025-01-05 11:41:40 +02:00
Lasse Collin 2655c81b5e
xzdec: Don't leave Landlock file descriptor open for no reason
This fix is similar to 48ff3f0652.

Fixes: d74fb5f060
2025-01-04 20:05:03 +02:00
Lasse Collin 35df4c2bc0
xz: Make --single-stream imply --keep
Suggested by xx on #tukaani on 2024-04-12.
2025-01-04 20:02:18 +02:00
Lasse Collin 6f412814a8
Update AUTHORS
The contributions have been rewritten.
2025-01-04 19:57:17 +02:00
Lasse Collin 5651d15303
xz: Avoid printf formats like %2$s
It's a POSIX feature that isn't in standard C. It's not available on
Windows. Even MinGW-w64 with __USE_MINGW_ANSI_STDIO doesn't support
it even though it supports POSIX %'d for thousand separators.

Gettext's <libintl.h> provides overrides for printf and other functions
which do support the %2$s formats. Translations use them. But xz should
work on Windows without <libintl.h> too.

Fixes: 3e9177fd20
2025-01-04 17:37:46 +02:00
Lasse Collin 63b246c90e
tuklib_mbstr_wrap: Add printf format attribute
It's supported by GCC 3.x already.
2025-01-04 17:37:46 +02:00
Lasse Collin a7313c01d9
xz: Translate a Windows-specific string
Originally I thought that native Windows builds wouldn't be translated
but nowadays at least MSYS2 ships such binaries.
2025-01-04 17:37:39 +02:00
Lasse Collin 00eb6073c0
xz: Use my_landlock.h
A slightly silly thing is that xz may now query the ABI version up to
three times. We could call my_landlock_ruleset_attr_forbid_all() only
once and cache the result but it didn't seem worth doing.
2025-01-02 15:43:38 +02:00
Lasse Collin 0fc5a625d7
xzdec: Use my_landlock.h 2025-01-02 15:43:38 +02:00
Lasse Collin 38cb8ec9fd
Add my_landlock.h with helper functions to use Linux Landlock
This supports up to Landlock ABI version 6. The current code in
xz and xzdec only support up to ABI version 4.
2025-01-02 15:43:38 +02:00
Lasse Collin 672da29bb3
liblzma: Silence warnings from "clang -Wimplicit-fallthrough" 2025-01-02 15:43:38 +02:00
Lasse Collin 94adc996e4
Replace "Fall through" comments with FALLTHROUGH 2025-01-02 15:43:37 +02:00
Lasse Collin f31c3a6647
sysdefs.h: Add FALLTHROUGH macro 2025-01-02 15:43:37 +02:00
Lasse Collin e34dbd6a0a
xzdec: Fix language in a comment 2025-01-02 15:43:37 +02:00
Lasse Collin 16821252c5
Windows: Make NLS require UCRT and gettext-runtime >= 0.23.1
Also remove the recently-added workaround from tuklib_gettext.h.
Requiring a new enough gettext-runtime is cleaner. I guess it's
mostly MSYS2 where xz is built with translation support, so once
MSYS2 has Gettext >= 0.23.1, this requirement shouldn't be a problem
in practice.
2025-01-02 15:35:25 +02:00
Lasse Collin 653732bd6f
xz man page: Describe the source file deletion in -z and -d options
The DESCRIPTION section always explained it, and the OPTIONS section
only described the differences to the default behavior. However, new
users in a hurry may skip reading DESCRIPTION. The default behavior
is a bit dangerous, thus it's good to repeat in --compress and
--decompress docs that source file is removed after successful operation.

Fixes: https://github.com/tukaani-project/xz/issues/150
2024-12-30 10:51:26 +02:00
Lasse Collin bb79f79b27
Build: Set libtool -version-info so that it matches with CMake
In the past, they haven't been in sync in development versions
although they (of course) have been in stable releases.
2024-12-29 10:54:45 +02:00
Lasse Collin 260d5d3620
xz: Fix comments 2024-12-27 09:14:56 +02:00
Lasse Collin f8c328eed1
Windows: Workaround a UTF-8 issue in Gettext's libintl_setlocale()
See the comment. In this package, locale is set at program startup and
not changed later, so the point (2) in the comment isn't a problem.

Fixes: 46ee006162
2024-12-20 16:33:34 +02:00
Lasse Collin 0353390609
Revert "Windows: Use UTF-8 locale when active code page is UTF-8"
This reverts commit 0d0b574cc4.
2024-12-20 16:33:34 +02:00
Lasse Collin 4b319e05af
xzdec: Use setlocale() instead of tuklib_gettext_setlocale()
xzdec isn't translated and doesn't need libintl on Windows even
when NLS is enabled, thus libintl_setlocale() cannot interfere
with the locale settings. Thus, standard setlocale() works perfectly.

In the commit 78868b6e, the explanation in the commit message is wrong.

Fixes: 78868b6ed6
2024-12-20 16:33:34 +02:00
Lasse Collin 34b80e282e
Windows: Revert the setlocale(LC_ALL, ".UTF8") documentation
Only leave the FindFileFirstA() notes from 20dfca81, reverting
the incorrect setlocale() notes. On Windows, Gettext's <libintl.h>
overrides setlocale() with libintl_setlocale() wrapper. I hadn't
noticed this, and thus my conclusions were wrong.

Fixes: 20dfca8171
2024-12-20 16:33:28 +02:00
Lasse Collin 5794cda064
tuklib_mbstr_wrap: Silence a warning from Clang
Fixes: ca529c3f41
2024-12-18 17:50:58 +02:00
Lasse Collin 22a35e64ce
lzmainfo: Use tuklib_mbstr_nonprint 2024-12-18 17:09:32 +02:00
Lasse Collin 03111595ee
xzdec: Use tuklib_mbstr_nonprint 2024-12-18 17:09:32 +02:00
Lasse Collin d22f96921f
xz: Use tuklib_mbstr_nonprint
Call tuklib_mask_nonprint() on filenames and also on a few other
strings from the command line too.

The filename printed by "xz --robot --list" (in list.c) is also masked.
It's good to get rid of tabs and newlines which would desync the output
but masking other chars wouldn't be strictly necessary. It might matter
with sensible filenames if LC_CTYPE is "C" (when iswprint() might reject
non-ASCII chars) and a script wants to read a filename from xz's output.
Hopefully it's an unusual enough corner case to not be a real problem.
2024-12-18 17:09:32 +02:00
Lasse Collin 40e5733055
Add tuklib_mbstr_nonprint to mask non-printable characters
Malicious filenames or other untrusted strings may affect the state of
the terminal when such strings are printed as part of (error) messages.
Add functions that mask such characters.

It's not enough to handle only single-byte control characters.
In multibyte locales, some control characters are multibyte too, for
example, terminals interpret C1 control characters (U+0080 to U+009F)
that are two bytes as UTF-8.

Instead of checking for control characters with iswcntrl(), this
uses iswprint() to detect printable characters. This is much stricter.
On Windows it's actually too strict as it rejects some characters that
definitely are printable.

Gnulib's quotearg would do a lot more but I hope this simpler method
is good enough here.

Thanks to Ryan Colyer for the discussion about the problems of
the earlier single-byte-only method.

Thanks to Christian Weisgerber for reporting a bug in an earlier
version of this code.

Thanks to Jeroen Roovers for a typo fix.

Closes: https://github.com/tukaani-project/xz/pull/118
2024-12-18 17:09:32 +02:00
Lasse Collin 4a0c4f92b8
xz: Make one string simpler for translators
Leading spaces in the string can get miscounted by translators.
2024-12-18 17:09:31 +02:00
Lasse Collin 3fcf547e92
lzmainfo: Sync the translatable strings with xz 2024-12-18 17:09:31 +02:00
Lasse Collin 3e9177fd20
xz: Use automatic word wrapping for help texts
--long-help is now one line longer because --lzma1 is now on its
own line.
2024-12-18 17:09:31 +02:00
Lasse Collin ca529c3f41
Add tuklib_mbstr_wrap for automatic word wrapping
Automatic word wrapping makes translators' work easier and reduces
errors like misaligned columns or overlong lines. Right-to-left
languages and languages that don't use spaces between words will
still need extra effort. (xz hasn't been translated to any RTL
language so far.)
2024-12-18 17:09:31 +02:00
Lasse Collin 314b83ceba
Build: Sort filenames to ASCII order in Makefile.am 2024-12-18 17:09:31 +02:00
Lasse Collin df399c5255
tuklib_mbstr_width: Add tuklib_mbstr_width_mem()
It's a new function split from tuklib_mbstr_width().
It's useful with partial strings that aren't terminated with \0.
2024-12-18 17:09:30 +02:00
Lasse Collin 51081efae4
tuklib_mbstr_width: Update a comment about shift states 2024-12-18 17:09:30 +02:00