xz/doc/liblzma-advanced.txt


Advanced features of liblzma
----------------------------

0. Introduction

    Most developers need only the basic features of liblzma. These
    features allow single-threaded encoding and decoding of .lzma files
    in streamed mode.

    In some cases developers want more. The .lzma file format is
    designed to allow multi-threaded encoding and decoding and limited
    random-access reading. These features are possible in non-streamed
    mode and limitedly also in streamed mode.

    To take advange of these features, the application needs a custom
    .lzma file format handler. liblzma provides a set of tools to ease
    this task, but it's still quite a bit of work to get a good custom
    .lzma handler done.


1. Where to begin

    Start by reading the .lzma file format specification. Understanding
    the basics of the .lzma file structure is required to implement a
    custom .lzma file handler and to understand the rest of this document.


2. The basic components

2.1. Stream Header and tail

    Stream Header begins the .lzma Stream and Stream tail ends it. Stream
    Header is defined in the file format specification, but Stream tail
    isn't (thus I write "tail" with a lower-case letter). Stream tail is
    simply the Stream Flags and the Footer Magic Bytes fields together.
    It was done this way in liblzma, because the Block coders take care
    of the rest of the stuff in the Stream Footer.

    For now, the size of Stream Header is fixed to 11 bytes. The header
    <lzma/stream_flags.h> defines LZMA_STREAM_HEADER_SIZE, which you
    should use instead of a hardcoded number. Similarly, Stream tail
    is fixed to 3 bytes, and there is a constant LZMA_STREAM_TAIL_SIZE.

    It is possible, that a future version of the .lzma format will have
    variable-sized Stream Header and tail. As of writing, this seems so
    unlikely though, that it was considered simplest to just use a
    constant instead of providing a functions to get and store the sizes
    of the Stream Header and tail.


2.x. Stream tail

    For now, the size of Stream tail is fixed to 3 bytes. The header
    <lzma/stream_flags.h> defines LZMA_STREAM_TAIL_SIZE, which you
    should use instead of a hardcoded number.


3. Keeping track of size information

    The lzma_info_* functions found from <lzma/info.h> should ease the
    task of keeping track of sizes of the Blocks and also the Stream
    as a whole. Using these functions is strongly recommended, because
    there are surprisingly many situations where an error can occur,
    and these functions check for possible errors every time some new
    information becomes available.

    If you find lzma_info_* functions lacking something that you would
    find useful, please contact the author.


3.1. Start offset of the Stream

    If you are storing the .lzma Stream inside anothe file format, or
    for some other reason are placing the .lzma Stream to somewhere
    else than to the beginning of the file, you should tell the starting
    offset of the Stream using lzma_info_start_offset_set().

    The start offset of the Stream is used for two distinct purporses.
    First, knowing the start offset of the Stream allows
    lzma_info_alignment_get() to correctly calculate the alignment of
    every Block. This information is given to the Block encoder, which
    will calculate the size of Header Padding so that Compressed Data
    is alignment at an optimal offset.

    Another use for start offset of the Stream is in random-access
    reading. If you set the start offset of the Stream, lzma_info_locate()
    will be able to calculate the offset relative to the beginning of the
    file containing the Stream (instead of offset relative to the
    beginning of the Stream).


3.2. Size of Stream Header

    While the size of Stream Header is constant (11 bytes) in the current
    version of the .lzma file format, this may change in future.


3.3. Size of Header Metadata Block

    This information is needed when doing random-access reading, and
    to verify the value of this field stored in Footer Metadata Block.


3.4. Total Size of the Data Blocks


3.5. Uncompressed Size of Data Blocks


3.6. Index


x. Alignment

    There are a few slightly different types of alignment issues when
    working with .lzma files.

    The .lzma format doesn't strictly require any kind of alignment.
    However, if the encoder carefully optimizes the alignment in all
    situations, it can improve compression ratio, speed of the encoder
    and decoder, and slightly help if the files get damaged and need
    recovery.

    Alignment has the most significant effect compression ratio FIXME


x.1. Compression ratio

    Some filters take advantage of the alignment of the input data.
    To get the best compression ratio, make sure that you feed these
    filters correctly aligned data.

    Some filters (e.g. LZMA) don't necessarily mind too much if the
    input doesn't match the preferred alignment. With these filters
    the penalty in compression ratio depends on the specific type of
    data being compressed.

    Other filters (e.g. PowerPC executable filter) won't work at all
    with data that is improperly aligned. While the data can still
    be de-filtered back to its original form, the benefit of the
    filtering (better compression ratio) is completely lost, because
    these filters expect certain patterns at properly aligned offsets.
    The compression ratio may even worse with incorrectly aligned input
    than without the filter.


x.1.1. Inter-filter alignment

    When there are multiple filters chained, checking the alignment can
    be useful not only with the input of the first filter and output of
    the last filter, but also between the filters.

    Inter-filter alignment important especially with the Subblock filter.


x.1.2. Further compression with external tools

    This is relatively rare situation in practice, but still worth
    understanding.

    Let's say that there are several SPARC executables, which are each
    filtered to separate .lzma files using only the SPARC filter. If
    Uncompressed Size is written to the Block Header, the size of Block
    Header may vary between the .lzma files. If no Padding is used in
    the Block Header to correct the alignment, the starting offset of
    the Compressed Data field will be differently aligned in different
    .lzma files.

    All these .lzma files are archived into a single .tar archive. Due
    to nature of the .tar format, every file is aligned inside the
    archive to an offset that is a multiple of 512 bytes.

    The .tar archive is compressed into a new .lzma file using the LZMA
    filter with options, that prefer input alignment of four bytes. Now
    if the independent .lzma files don't have the same alignment of
    the Compressed Data fields, the LZMA filter will be unable to take
    advantage of the input alignment between the files in the .tar
    archive, which reduces compression ratio.

    Thus, even if you have only single Block per file, it can be good for
    compression ratio to align the Compressed Data to optimal offset.


x.2. Speed

    Most modern computers are faster when multi-byte data is located
    at aligned offsets in RAM. Proper alignment of the Compressed Data
    fields can slightly increase the speed of some filters.


x.3. Recovery

    Aligning every Block Header to start at an offset with big enough
    alignment may ease or at least speed up recovery of broken files.


y. Typical usage cases

y.x. Parsing the Stream backwards

    You may need to parse the Stream backwards if you need to get
    information such as the sizes of the Stream, Index, or Extra.
    The basic procedure to do this follows.

    Locate the end of the Stream. If the Stream is stored as is in a
    standalone .lzma file, simply seek to the end of the file and start
    reading backwards using appropriate buffer size. The file format
    specification allows arbitrary amount of Footer Padding (zero or more
    NUL bytes), which you skip before trying to decode the Stream tail.

    Once you have located the end of the Stream (a non-NULL byte), make
    sure you have at least the last LZMA_STREAM_TAIL_SIZE bytes of the
    Stream in a buffer. If there isn't enough bytes left from the file,
    the file is too small to contain a valid Stream. Decode the Stream
    tail using lzma_stream_tail_decoder(). Store the offset of the first
    byte of the Stream tail; you will need it later.

    You may now want to do some internal verifications e.g. if the Check
    type is supported by the liblzma build you are using.

    Decode the Backward Size field with lzma_vli_reverse_decode(). The
    field is at maximum of LZMA_VLI_BYTES_MAX bytes long. Check that
    Backward Size is not zero. Store the offset of the first byte of
    the Backward Size; you will need it later.

    Now you know the Total Size of the last Block of the Stream. It's the
    value of Backward Size plus the size of the Backward Size field. Note
    that you cannot use lzma_vli_size() to calculate the size since there
    might be padding; you need to use the real observed size of the
    Backward Size field.

    At this point, the operation continues differently for Single-Block
    and Multi-Block Streams.


y.x.1. Single-Block Stream

    There might be Uncompressed Size field present in the Stream Footer.
    You cannot know it for sure unless you have already parsed the Block
    Header earlier. For security reasons, you probably want to try to
    decode the Uncompressed Size field, but you must not indicate any
    error if decoding fails. Later you can give the decoded Uncompressed
    Size to Block decoder if Uncopmressed Size isn't otherwise known;
    this prevents it from producing too much output in case of (possibly
    intentionally) corrupt file.

    Calculate the start offset of the Stream:

        backward_offset - backward_size - LZMA_STREAM_HEADER_SIZE

    backward_offset is the offset of the first byte of the Backward Size
    field. Remember to check for integer overflows, which can occur with
    invalid input files.

    Seek to the beginning of the Stream. Decode the Stream Header using
    lzma_stream_header_decoder(). Verify that the decoded Stream Flags
    match the values found from Stream tail. You can use the
    lzma_stream_flags_is_equal() macro for this.

    Decode the Block Header. Verify that it isn't a Metadata Block, since
    Single-Block Streams cannot have Metadata. If Uncompressed Size is
    present in the Block Header, the value you tried to decode from the
    Stream Footer must be ignored, since Uncompressed Size wasn't actually
    present there. If Block Header doesn't have Uncompressed Size, and
    decoding the Uncompressed Size field from the Stream Footer failed,
    the file is corrupt.

    If you were only looking for the Uncompressed Size of the Stream,
    you now got that information, and you can stop processing the Stream.

    To decode the Block, the same instructions apply as described in
    FIXME. However, because you have some extra known information decoded
    from the Stream Footer, you should give this information to the Block
    decoder so that it can verify it while decoding:
      - If Uncompressed Size is not present in the Block Header, set
        lzma_options_block.uncompressed_size to the value you decoded
        from the Stream Footer.
      - Always set lzma_options_block.total_size to backward_size +
        size_of_backward_size (you calculated this sum earlier already).


y.x.2. Multi-Block Stream

    Calculate the start offset of the Footer Metadata Block:

        backward_offset - backward_size

    backward_offset is the offset of the first byte of the Backward Size
    field. Remember to check for integer overflows, which can occur with
    broken input files.

    Decode the Block Header. Verify that it is a Metadata Block. Set
    lzma_options_block.total_size to backward_size + size_of_backward_size
    (you calculated this sum earlier already). Then decode the Footer
    Metadata Block.

    Store the decoded Footer Metadata to lzma_info structure using
    lzma_info_set_metadata(). Set also the offset of the Backward Size
    field using lzma_info_size_set(). Then you can get the start offset
    of the Stream using lzma_info_size_get(). Note that any of these steps
    may fail so don't omit error checking.

    Seek to the beginning of the Stream. Decode the Stream Header using
    lzma_stream_header_decoder(). Verify that the decoded Stream Flags
    match the values found from Stream tail. You can use the
    lzma_stream_flags_is_equal() macro for this.

    If you were only looking for the Uncompressed Size of the Stream,
    it's possible that you already have it now. If Uncompressed Size (or
    whatever information you were looking for) isn't available yet,
    continue by decoding also the Header Metadata Block. (If some
    information is missing, the Header Metadata Block has to be present.)

    Decoding the Data Blocks goes the same way as described in FIXME.


y.x.3. Variations

    If you know the offset of the beginning of the Stream, you may want
    to parse the Stream Header before parsing the Stream tail.