xz/doc/liblzma-intro.txt

195 lines
7.8 KiB
Plaintext
Raw Permalink Normal View History

2007-12-08 22:42:33 +00:00
Introduction to liblzma
-----------------------
Writing applications to work with liblzma
liblzma API is split in several subheaders to improve readability and
maintainance. The subheaders must not be #included directly. lzma.h
requires that certain integer types and macros are available when
the header is #included. On systems that have inttypes.h that conforms
to C99, the following will work:
#include <sys/types.h>
#include <inttypes.h>
#include <lzma.h>
2007-12-08 22:42:33 +00:00
Those who have used zlib should find liblzma's API easy to use.
To developers who haven't used zlib before, I recommend learning
zlib first, because zlib has excellent documentation.
While the API is similar to that of zlib, there are some major
differences, which are summarized below.
For basic stream encoding, zlib has three functions (deflateInit(),
deflate(), and deflateEnd()). Similarly, there are three functions
for stream decoding (inflateInit(), inflate(), and inflateEnd()).
liblzma has only single coding and ending function. Thus, to
encode one may use, for example, lzma_stream_encoder_single(),
lzma_code(), and lzma_end(). Simlarly for decoding, one may
use lzma_auto_decoder(), lzma_code(), and lzma_end().
zlib has deflateReset() and inflateReset() to reset the stream
structure without reallocating all the memory. In liblzma, all
coder initialization functions are like zlib's reset functions:
the first-time initializations are done with the same functions
as the reinitializations (resetting).
To make all this work, liblzma needs to know when lzma_stream
doesn't already point to an allocated and initialized coder.
This is achieved by initializing lzma_stream structure with
LZMA_STREAM_INIT (static initialization) or LZMA_STREAM_INIT_VAR
(for exampple when new lzma_stream has been allocated with malloc()).
This initialization should be done exactly once per lzma_stream
structure to avoid leaking memory. Calling lzma_end() will leave
lzma_stream into a state comparable to the state achieved with
LZMA_STREAM_INIT and LZMA_STREAM_INIT_VAR.
Example probably clarifies a lot. With zlib, compression goes
roughly like this:
z_stream strm;
deflateInit(&strm, level);
deflate(&strm, Z_RUN);
deflate(&strm, Z_RUN);
...
deflate(&strm, Z_FINISH);
deflateEnd(&strm) or deflateReset(&strm)
With liblzma, it's slightly different:
lzma_stream strm = LZMA_STREAM_INIT;
lzma_stream_encoder_single(&strm, &options);
lzma_code(&strm, LZMA_RUN);
lzma_code(&strm, LZMA_RUN);
...
lzma_code(&strm, LZMA_FINISH);
lzma_end(&strm) or reinitialize for new coding work
Reinitialization in the last step can be any function that can
initialize lzma_stream; it doesn't need to be the same function
that was used for the previous initialization. If it is the same
function, liblzma will usually be able to re-use most of the
existing memory allocations (depends on how much the initialization
options change). If you reinitialize with different function,
liblzma will automatically free the memory of the previous coder.
File formats
liblzma supports multiple container formats for the compressed data.
Different initialization functions initialize the lzma_stream to
process different container formats. See the details from the public
header files.
The following functions are the most commonly used:
- lzma_stream_encoder_single(): Encodes Single-Block Stream; this
the recommended format for most purporses.
- lzma_alone_encoder(): Useful if you need to encode into the
legacy LZMA_Alone format.
- lzma_auto_decoder(): Decoder that automatically detects the
file format; recommended when you decode compressed files on
disk, because this way compatibility with the legacy LZMA_Alone
format is transparent.
- lzma_stream_decoder(): Decoder for Single- and Multi-Block
Streams; this is good if you want to accept only .lzma Streams.
Filters
liblzma supports multiple filters (algorithm implementations). The new
.lzma format supports filter-chain having up to seven filters. In the
filter chain, the output of one filter is input of the next filter in
the chain. The legacy LZMA_Alone format supports only one filter, and
that must always be LZMA.
General-purporse compression:
LZMA The main algorithm of liblzma (surprise!)
Branch/Call/Jump filters for executables:
x86 This filter is known as BCJ in 7-Zip
IA64 IA-64 (Itanium)
PowerPC Big endian PowerPC
ARM
ARM-Thumb
SPARC
Other filters:
Copy Dummy filter that simply copies all the data
from input to output.
Subblock Multi-purporse filter, that can
- embed End of Payload Marker if the previous
filter in the chain doesn't support it; and
- apply Subfilters, which filter only part
of the same compressed Block in the Stream.
Branch/Call/Jump filters never change the size of the data. They
should usually be used as a pre-filter for some compression filter
like LZMA.
Integrity checks
The .lzma Stream format uses CRC32 as the integrity check for
different file format headers. It is possible to omit CRC32 from
the Block Headers, but not from Stream Header. This is the reason
why CRC32 code cannot be disabled when building liblzma (in addition,
the LZMA encoder uses CRC32 for hashing, so that's another reason).
The integrity check of the actual data is calculated from the
uncompressed data. This check can be CRC32, CRC64, or SHA256.
It can also be omitted completely, although that usually is not
a good thing to do. There are free IDs left, so support for new
checks algorithms can be added later.
API and ABI stability
The API and ABI of liblzma isn't stable yet, although no huge
changes should happen. One potential place for change is the
lzma_options_subblock structure.
In the 4.42.0alpha phase, the shared library version number won't
be updated even if ABI breaks. I don't want to track the ABI changes
yet. Just rebuild everything when you upgrade liblzma until we get
to the beta stage.
Size of the library
While liblzma isn't huge, it is quite far from the smallest possible
LZMA implementation: full liblzma binary (with support for all
filters and other features) is way over 100 KiB, but the plain raw
LZMA decoder is only 5-10 KiB.
To decrease the size of the library, you can omit parts of the library
by passing certain options to the `configure' script. Disabling
everything but the decoders of the require filters will usually give
you a small enough library, but if you need a decoder for example
embedded in the operating system kernel, the code from liblzma probably
isn't suitable as is.
If you need a minimal implementation supporting .lzma Streams, you
may need to do partial rewrite. liblzma uses stateful API like zlib.
That increases the size of the library. Using callback API or even
simpler buffer-to-buffer API would allow smaller implementation.
LZMA SDK contains smaller LZMA decoder written in ANSI-C than
liblzma, so you may want to take a look at that code. However,
it doesn't (at least not yet) support the new .lzma Stream format.
Documentation
There's no other documentation than the public headers and this
text yet. Real docs will be written some day, I hope.