.xz and .lzma Test Files ------------------------ 0. Introduction This directory contains bunch of files to test handling of .xz, .lzma (LZMA_Alone), and .lz (lzip) files in decoder implementations. Many of the files have been created by hand with a hex editor, thus there is no better "source code" than the files themselves. All the test files and this README may be distributed under the terms of the BSD Zero Clause License (0BSD). 1. File Types Good files (good-*) must decode successfully without requiring a lot of CPU time or RAM. Unsupported files (unsupported-*) are good files, but headers indicate features not supported by the current file format specification. Bad files (bad-*) must cause the decoder to give an error. Like with the good files, these files must not require a lot of CPU time or RAM before they get detected to be broken. 2. Descriptions of Individual .xz Files 2.1. Good Files good-0-empty.xz has one Stream with no Blocks. good-0pad-empty.xz has one Stream with no Blocks followed by four-byte Stream Padding. good-0cat-empty.xz has two zero-Block Streams concatenated without Stream Padding. good-0catpad-empty.xz has two zero-Block Streams concatenated with four-byte Stream Padding between the Streams. good-1-check-none.xz has one Stream with one Block with two uncompressed LZMA2 chunks and no integrity check. good-1-check-crc32.xz has one Stream with one Block with two uncompressed LZMA2 chunks and CRC32 check. good-1-check-crc64.xz is like good-1-check-crc32.xz but with CRC64. good-1-check-sha256.xz is like good-1-check-crc32.xz but with SHA256. good-2-lzma2.xz has one Stream with two Blocks with one uncompressed LZMA2 chunk in each Block. good-1-block_header-1.xz has both Compressed Size and Uncompressed Size in the Block Header. This has also four extra bytes of Header Padding. good-1-block_header-2.xz has known Compressed Size. good-1-block_header-3.xz has known Uncompressed Size. good-1-delta-lzma2.tiff.xz is an image file that compresses better with Delta+LZMA2 than with plain LZMA2. good-1-x86-lzma2.xz uses the x86 filter (BCJ) and LZMA2. The uncompressed file is compress_prepared_bcj_x86 found from the tests directory. good-1-sparc-lzma2.xz uses the SPARC filter and LZMA2. The uncompressed file is compress_prepared_bcj_sparc found from the tests directory. good-1-arm64-lzma2-1.xz uses the ARM64 filter and LZMA2. The uncompressed data is constructed so that it tests integer wrap around and sign extension. To recreate the file, compress using XZ Utils 5.4.x (newer may or may not work too): ./debug/testfilegen-arm64 \ | xz -T1 -Ccrc32 --arm64 \ --lzma2=dict=64KiB,lp=2,lc=2 \ > good-1-arm64-lzma2-1.xz good-1-arm64-lzma2-2.xz is like good-1-arm64-lzma2-1.xz but with non-zero start offset. XZ Embedded doesn't support this file. To recreate the file, compress using XZ Utils 5.4.x (newer may or may not work too): ./debug/testfilegen-arm64 \ | xz -T1 -Ccrc32 --arm64=start=4294963200 \ --lzma2=dict=64KiB,lp=2,lc=2 \ > good-1-arm64-lzma2-2.xz good-1-lzma2-1.xz has two LZMA2 chunks, of which the second sets new properties. good-1-lzma2-2.xz has two LZMA2 chunks, of which the second resets the state without specifying new properties. good-1-lzma2-3.xz has two LZMA2 chunks, of which the first is uncompressed and the second is LZMA. The first chunk resets dictionary and the second sets new properties. good-1-lzma2-4.xz has three LZMA2 chunks: First is LZMA, second is uncompressed with dictionary reset, and third is LZMA with new properties but without dictionary reset. good-1-lzma2-5.xz has an empty LZMA2 stream with only the end of payload marker. XZ Utils 5.0.1 and older incorrectly see this file as corrupt. good-1-3delta-lzma2.xz has three Delta filters and LZMA2. good-1-empty-bcj-lzma2.xz has an empty Block that uses PowerPC BCJ and LZMA2. liblzma from XZ Utils 5.0.1 and older may incorrectly return LZMA_BUF_ERROR in some cases. See commit message d8db706acb8316f9861abd432cfbe001dd6d0c5c for the details. 2.2. Unsupported Files unsupported-check.xz uses Check ID 0x02 which isn't supported by the current version of the file format. It is implementation-defined how this file handled (it may reject it, or decode it possibly with a warning). unsupported-block_header.xz has a non-null byte in Header Padding, which may indicate presence of a new unsupported field. unsupported-filter_flags-1.xz has unsupported Filter ID 0x7F. unsupported-filter_flags-2.xz specifies only Delta filter in the List of Filter Flags, but Delta isn't allowed as the last filter in the chain. It could be a little more correct to detect this file as corrupt instead of unsupported, but saying it is unsupported is simpler in case of liblzma. unsupported-filter_flags-3.xz specifies two LZMA2 filters in the List of Filter Flags. LZMA2 is allowed only as the last filter in the chain. It could be a little more correct to detect this file as corrupt instead of unsupported, but saying it is unsupported is simpler in case of liblzma. 2.3. Bad Files bad-0pad-empty.xz has one Stream with no Blocks followed by five-byte Stream Padding. Stream Padding must be a multiple of four bytes, thus this file is corrupt. bad-0catpad-empty.xz has two zero-Block Streams concatenated with five-byte Stream Padding between the Streams. bad-0cat-alone.xz is good-0-empty.xz concatenated with an empty LZMA_Alone file. bad-0cat-header_magic.xz is good-0cat-empty.xz but with one byte wrong in the Header Magic Bytes field of the second Stream. liblzma gives LZMA_DATA_ERROR for this. (LZMA_FORMAT_ERROR is used only if the first Stream of a file has invalid Header Magic Bytes.) bad-0-header_magic.xz is good-0-empty.xz but with one byte wrong in the Header Magic Bytes field. liblzma gives LZMA_FORMAT_ERROR for this. bad-0-footer_magic.xz is good-0-empty.xz but with one byte wrong in the Footer Magic Bytes field. liblzma gives LZMA_DATA_ERROR for this. bad-0-empty-truncated.xz is good-0-empty.xz without the last byte of the file. bad-0-nonempty_index.xz has no Blocks but Index claims that there is one Block. bad-0-backward_size.xz has wrong Backward Size in Stream Footer. bad-1-stream_flags-1.xz has different Stream Flags in Stream Header and Stream Footer. bad-1-stream_flags-2.xz has wrong CRC32 in Stream Header. bad-1-stream_flags-3.xz has wrong CRC32 in Stream Footer. bad-1-vli-1.xz has two-byte variable-length integer in the Uncompressed Size field in Block Header while one-byte would be enough for that value. It's important that the file gets rejected due to too big integer encoding instead of due to Uncompressed Size not matching the value stored in the Block Header. That is, the decoder must not try to decode the Compressed Data field. bad-1-vli-2.xz has ten-byte variable-length integer as Uncompressed Size in Block Header. It's important that the file gets rejected due to too big integer encoding instead of due to Uncompressed Size not matching the value stored in the Block Header. That is, the decoder must not try to decode the Compressed Data field. bad-1-block_header-1.xz has Block Header that ends in the middle of the Filter Flags field. bad-1-block_header-2.xz has Block Header that has Compressed Size and Uncompressed Size but no List of Filter Flags field. bad-1-block_header-3.xz has wrong CRC32 in Block Header. bad-1-block_header-4.xz has too big Compressed Size in Block Header (2^63 - 1 bytes while maximum is a little less, because the whole Block must stay smaller than 2^63). It's important that the file gets rejected due to invalid Compressed Size value; the decoder must not try decoding the Compressed Data field. bad-1-block_header-5.xz has zero as Compressed Size in Block Header. bad-1-block_header-6.xz has corrupt Block Header which may crash xz -lvv in XZ Utils 5.0.3 and earlier. It was fixed in the commit c0297445064951807803457dca1611b3c47e7f0f. bad-2-index-1.xz has wrong Unpadded Sizes in Index. bad-2-index-2.xz has wrong Uncompressed Sizes in Index. bad-2-index-3.xz has non-null byte in Index Padding. bad-2-index-4.xz wrong CRC32 in Index. bad-2-index-5.xz has zero as Unpadded Size. It is important that the file gets rejected specifically due to Unpadded Size having an invalid value. bad-3-index-uncomp-overflow.xz has Index whose Uncompressed Size fields have huge values whose sum exceeds the maximum allowed size of 2^63 - 1 bytes. In this file the sum is exactly 2^64. lzma_index_append() in liblzma <= 5.2.6 lacks the integer overflow check for the uncompressed size and thus doesn't catch the error when decoding the Index field in this file. This makes "xz -l" not detect the error and will display 0 as the uncompressed size. Note that regular decompression isn't affected by this bug because it uses lzma_index_hash_append() instead. bad-2-compressed_data_padding.xz has non-null byte in the padding of the Compressed Data field of the first Block. bad-1-check-crc32.xz has wrong Check (CRC32). bad-1-check-crc32-2.xz has Compressed Size and Uncompressed Size in Block Header but wrong Check (CRC32) in the actual data. This file differs by one byte from good-1-block_header-1.xz: the last byte of the Check field is wrong. This file is useful for testing error detection in the threaded decoder when a worker thread is configured to pass input one byte at a time to the Block decoder. bad-1-check-crc64.xz has wrong Check (CRC64). bad-1-check-sha256.xz has wrong Check (SHA-256). bad-1-lzma2-1.xz has LZMA2 stream whose first chunk (uncompressed) doesn't reset the dictionary. bad-1-lzma2-2.xz has two LZMA2 chunks, of which the second chunk indicates dictionary reset, but the LZMA compressed data tries to repeat data from the previous chunk. bad-1-lzma2-3.xz sets new invalid properties (lc=8, lp=0, pb=0) in the middle of Block. bad-1-lzma2-4.xz has two LZMA2 chunks, of which the first is uncompressed and the second is LZMA. The first chunk resets dictionary as it should, but the second chunk tries to reset state without specifying properties for LZMA. bad-1-lzma2-5.xz is like bad-1-lzma2-4.xz but doesn't try to reset anything in the header of the second chunk. bad-1-lzma2-6.xz has reserved LZMA2 control byte value (0x03). bad-1-lzma2-7.xz has EOPM at LZMA level. bad-1-lzma2-8.xz is like good-1-lzma2-4.xz but doesn't set new properties in the third LZMA2 chunk. bad-1-lzma2-9.xz has LZMA2 stream that is truncated at the end of a LZMA2 chunk (no end marker). The uncompressed size of the partial LZMA2 stream exceeds the value stored in the Block Header. bad-1-lzma2-10.xz has LZMA2 stream that, from point of view of a LZMA2 decoder, extends past the end of Block (and even the end of the file). Uncompressed Size in Block Header is bigger than the invalid LZMA2 stream may produce (even if a decoder reads until the end of the file). The Check type is None to nullify certain simple size-based sanity checks in a Block decoder. bad-1-lzma2-11.xz has LZMA2 stream that lacks the end of payload marker. When Compressed Size bytes have been decoded, Uncompressed Size bytes of output will have been produced but the LZMA2 decoder doesn't indicate end of stream. 3. Descriptions of Individual .lzma Files 3.1. Good Files good-unknown_size-with_eopm.lzma has unknown size in the header and end of payload marker at the end. good-known_size-without_eopm.lzma has a known size in the header and no end of payload marker at the end. good-known_size-with_eopm.lzma has a known size in the header and end of payload marker at the end. XZ Utils 5.2.5 and older will give an error at the end of the file after producing the correct uncompressed output. 3.2. Bad Files bad-unknown_size-without_eopm.lzma has unknown size in the header but no end of payload marker at the end. This file might be seen by a decoder as if it were truncated. bad-too_big_size-with_eopm.lzma has too big uncompressed size in the header and the end of payload marker will be detected before the specified number of bytes have been decoded. bad-too_small_size-without_eopm-1.lzma has too small uncompressed size in the header. The decoder will look for end of payload marker but instead find a literal that would produce more output. bad-too_small_size-without_eopm-2.lzma is like -1 above but instead of a literal the problem occurs with a short repeated match. bad-too_small_size-without_eopm-3.lzma is like -1 above but instead of a literal the problem occurs in the middle of a match. 4. Descriptions of Individual .lz (lzip) Files 4.1. Good Files good-1-v0.lz contains a single version 0 member. lzip 1.17 and *older* can decompress this; support for version 0 was removed in lzip 1.18. good-1-v0-trailing-1.lz is like good-1-v0.lz but contains trailing data that the decompressor must ignore. good-1-v1.lz contains a single version 1 member. lzip 1.3 and newer can decompress this. good-1-v1-trailing-1.lz is like good-1-v1.lz but contains trailing data that the decompressor must ignore. good-1-v1-trailing-2.lz is like good-1-v1.lz but contains trailing data whose first three bytes match the .lz magic bytes. With lzip >= 1.20 this file results in an error unless one uses the command line option --loose-trailing. lzip 1.3 to 1.19 decode this file successfully by default. XZ Utils uses the old behavior because it allows lzma_code() to stop at the first byte of the trailing data as long as the first byte isn't 0x4C (L in US-ASCII); otherwise the first 1-3 bytes that equal to the magic bytes are consumed and lost in lzma_code(), and this is visible in xz too: $ ( xz -dc ; cat ) < good-1-v1-trailing-2.lz Hello World! Trailing garbage $ ( xz -dc --single-stream ; cat ) < good-1-v1-trailing-2.lz Hello World! LZITrailing garbage good-2-v0-v1.lz contains two members of which the first is version 0 and the second version 1. lzip versions 1.3 to 1.17 (inclusive) can decompress this. good-2-v1-v0.lz contains two members of which the first is version 1 and the second version 0. lzip versions 1.3 to 1.17 (inclusive) can decompress this. good-2-v1-v1.lz contains two version 1 members. lzip versions 1.3 and newer can decompress this. 4.2. Unsupported Files unsupported-1-v234.lz is like good-1-v1.lz except the version field has been set to 234 (0xEA) which, as of writing, isn't defined or supported by any .lz implementation. 4.3. Bad Files bad-1-v1-magic-1.lz is like good-1-v1.lz but the first magic byte is wrong. bad-1-v1-magic-2.lz is like good-1-v1.lz but the last (fourth) magic byte is wrong. bad-1-v1-dict-1.lz has too low value in the dictionary size field. bad-1-v1-dict-2.lz has too high value in the dictionary size field. bad-1-v1-crc32.lz has wrong CRC32 value. bad-1-v0-uncomp-size.lz is version 0 format with incorrect value in the uncompressed size field. bad-1-v1-uncomp-size.lz is version 1 format with incorrect value in the uncompressed size field. bad-1-v1-member-size.lz has incorrect value in the member size field. bad-1-v1-trailing-magic.lz has the four .lz magic bytes as trailing data. This should be detected as a truncated file and thus result in an error. That is, the last four bytes of the file should not be ignored as trailing garbage. lzip >= 1.18 matches this behavior while older versions ignore the last four bytes and don't indicate an error.