2007-12-08 22:42:33 +00:00
|
|
|
|
|
|
|
Introduction to liblzma
|
|
|
|
-----------------------
|
|
|
|
|
|
|
|
Writing applications to work with liblzma
|
|
|
|
|
|
|
|
liblzma API is split in several subheaders to improve readability and
|
2008-01-06 14:27:41 +00:00
|
|
|
maintainance. The subheaders must not be #included directly. lzma.h
|
|
|
|
requires that certain integer types and macros are available when
|
|
|
|
the header is #included. On systems that have inttypes.h that conforms
|
|
|
|
to C99, the following will work:
|
|
|
|
|
|
|
|
#include <sys/types.h>
|
|
|
|
#include <inttypes.h>
|
|
|
|
#include <lzma.h>
|
2007-12-08 22:42:33 +00:00
|
|
|
|
|
|
|
Those who have used zlib should find liblzma's API easy to use.
|
|
|
|
To developers who haven't used zlib before, I recommend learning
|
|
|
|
zlib first, because zlib has excellent documentation.
|
|
|
|
|
|
|
|
While the API is similar to that of zlib, there are some major
|
|
|
|
differences, which are summarized below.
|
|
|
|
|
|
|
|
For basic stream encoding, zlib has three functions (deflateInit(),
|
|
|
|
deflate(), and deflateEnd()). Similarly, there are three functions
|
|
|
|
for stream decoding (inflateInit(), inflate(), and inflateEnd()).
|
|
|
|
liblzma has only single coding and ending function. Thus, to
|
|
|
|
encode one may use, for example, lzma_stream_encoder_single(),
|
|
|
|
lzma_code(), and lzma_end(). Simlarly for decoding, one may
|
|
|
|
use lzma_auto_decoder(), lzma_code(), and lzma_end().
|
|
|
|
|
|
|
|
zlib has deflateReset() and inflateReset() to reset the stream
|
|
|
|
structure without reallocating all the memory. In liblzma, all
|
|
|
|
coder initialization functions are like zlib's reset functions:
|
|
|
|
the first-time initializations are done with the same functions
|
|
|
|
as the reinitializations (resetting).
|
|
|
|
|
|
|
|
To make all this work, liblzma needs to know when lzma_stream
|
|
|
|
doesn't already point to an allocated and initialized coder.
|
|
|
|
This is achieved by initializing lzma_stream structure with
|
|
|
|
LZMA_STREAM_INIT (static initialization) or LZMA_STREAM_INIT_VAR
|
|
|
|
(for exampple when new lzma_stream has been allocated with malloc()).
|
|
|
|
This initialization should be done exactly once per lzma_stream
|
|
|
|
structure to avoid leaking memory. Calling lzma_end() will leave
|
|
|
|
lzma_stream into a state comparable to the state achieved with
|
|
|
|
LZMA_STREAM_INIT and LZMA_STREAM_INIT_VAR.
|
|
|
|
|
|
|
|
Example probably clarifies a lot. With zlib, compression goes
|
|
|
|
roughly like this:
|
|
|
|
|
|
|
|
z_stream strm;
|
|
|
|
deflateInit(&strm, level);
|
|
|
|
deflate(&strm, Z_RUN);
|
|
|
|
deflate(&strm, Z_RUN);
|
|
|
|
...
|
|
|
|
deflate(&strm, Z_FINISH);
|
|
|
|
deflateEnd(&strm) or deflateReset(&strm)
|
|
|
|
|
|
|
|
With liblzma, it's slightly different:
|
|
|
|
|
|
|
|
lzma_stream strm = LZMA_STREAM_INIT;
|
|
|
|
lzma_stream_encoder_single(&strm, &options);
|
|
|
|
lzma_code(&strm, LZMA_RUN);
|
|
|
|
lzma_code(&strm, LZMA_RUN);
|
|
|
|
...
|
|
|
|
lzma_code(&strm, LZMA_FINISH);
|
|
|
|
lzma_end(&strm) or reinitialize for new coding work
|
|
|
|
|
|
|
|
Reinitialization in the last step can be any function that can
|
|
|
|
initialize lzma_stream; it doesn't need to be the same function
|
|
|
|
that was used for the previous initialization. If it is the same
|
|
|
|
function, liblzma will usually be able to re-use most of the
|
|
|
|
existing memory allocations (depends on how much the initialization
|
|
|
|
options change). If you reinitialize with different function,
|
|
|
|
liblzma will automatically free the memory of the previous coder.
|
|
|
|
|
|
|
|
|
|
|
|
File formats
|
|
|
|
|
|
|
|
liblzma supports multiple container formats for the compressed data.
|
|
|
|
Different initialization functions initialize the lzma_stream to
|
|
|
|
process different container formats. See the details from the public
|
|
|
|
header files.
|
|
|
|
|
|
|
|
The following functions are the most commonly used:
|
|
|
|
|
|
|
|
- lzma_stream_encoder_single(): Encodes Single-Block Stream; this
|
|
|
|
the recommended format for most purporses.
|
|
|
|
|
|
|
|
- lzma_alone_encoder(): Useful if you need to encode into the
|
|
|
|
legacy LZMA_Alone format.
|
|
|
|
|
|
|
|
- lzma_auto_decoder(): Decoder that automatically detects the
|
|
|
|
file format; recommended when you decode compressed files on
|
|
|
|
disk, because this way compatibility with the legacy LZMA_Alone
|
|
|
|
format is transparent.
|
|
|
|
|
|
|
|
- lzma_stream_decoder(): Decoder for Single- and Multi-Block
|
|
|
|
Streams; this is good if you want to accept only .lzma Streams.
|
|
|
|
|
|
|
|
|
|
|
|
Filters
|
|
|
|
|
|
|
|
liblzma supports multiple filters (algorithm implementations). The new
|
|
|
|
.lzma format supports filter-chain having up to seven filters. In the
|
|
|
|
filter chain, the output of one filter is input of the next filter in
|
|
|
|
the chain. The legacy LZMA_Alone format supports only one filter, and
|
|
|
|
that must always be LZMA.
|
|
|
|
|
|
|
|
General-purporse compression:
|
|
|
|
|
|
|
|
LZMA The main algorithm of liblzma (surprise!)
|
|
|
|
|
|
|
|
Branch/Call/Jump filters for executables:
|
|
|
|
|
|
|
|
x86 This filter is known as BCJ in 7-Zip
|
|
|
|
IA64 IA-64 (Itanium)
|
|
|
|
PowerPC Big endian PowerPC
|
|
|
|
ARM
|
|
|
|
ARM-Thumb
|
|
|
|
SPARC
|
|
|
|
|
|
|
|
Other filters:
|
|
|
|
|
|
|
|
Copy Dummy filter that simply copies all the data
|
|
|
|
from input to output.
|
|
|
|
|
|
|
|
Subblock Multi-purporse filter, that can
|
|
|
|
- embed End of Payload Marker if the previous
|
|
|
|
filter in the chain doesn't support it; and
|
|
|
|
- apply Subfilters, which filter only part
|
|
|
|
of the same compressed Block in the Stream.
|
|
|
|
|
|
|
|
Branch/Call/Jump filters never change the size of the data. They
|
|
|
|
should usually be used as a pre-filter for some compression filter
|
|
|
|
like LZMA.
|
|
|
|
|
|
|
|
|
|
|
|
Integrity checks
|
|
|
|
|
|
|
|
The .lzma Stream format uses CRC32 as the integrity check for
|
|
|
|
different file format headers. It is possible to omit CRC32 from
|
|
|
|
the Block Headers, but not from Stream Header. This is the reason
|
|
|
|
why CRC32 code cannot be disabled when building liblzma (in addition,
|
|
|
|
the LZMA encoder uses CRC32 for hashing, so that's another reason).
|
|
|
|
|
|
|
|
The integrity check of the actual data is calculated from the
|
|
|
|
uncompressed data. This check can be CRC32, CRC64, or SHA256.
|
|
|
|
It can also be omitted completely, although that usually is not
|
|
|
|
a good thing to do. There are free IDs left, so support for new
|
|
|
|
checks algorithms can be added later.
|
|
|
|
|
|
|
|
|
|
|
|
API and ABI stability
|
|
|
|
|
|
|
|
The API and ABI of liblzma isn't stable yet, although no huge
|
|
|
|
changes should happen. One potential place for change is the
|
|
|
|
lzma_options_subblock structure.
|
|
|
|
|
|
|
|
In the 4.42.0alpha phase, the shared library version number won't
|
|
|
|
be updated even if ABI breaks. I don't want to track the ABI changes
|
|
|
|
yet. Just rebuild everything when you upgrade liblzma until we get
|
|
|
|
to the beta stage.
|
|
|
|
|
|
|
|
|
|
|
|
Size of the library
|
|
|
|
|
|
|
|
While liblzma isn't huge, it is quite far from the smallest possible
|
|
|
|
LZMA implementation: full liblzma binary (with support for all
|
|
|
|
filters and other features) is way over 100 KiB, but the plain raw
|
|
|
|
LZMA decoder is only 5-10 KiB.
|
|
|
|
|
|
|
|
To decrease the size of the library, you can omit parts of the library
|
|
|
|
by passing certain options to the `configure' script. Disabling
|
|
|
|
everything but the decoders of the require filters will usually give
|
|
|
|
you a small enough library, but if you need a decoder for example
|
|
|
|
embedded in the operating system kernel, the code from liblzma probably
|
|
|
|
isn't suitable as is.
|
|
|
|
|
|
|
|
If you need a minimal implementation supporting .lzma Streams, you
|
|
|
|
may need to do partial rewrite. liblzma uses stateful API like zlib.
|
|
|
|
That increases the size of the library. Using callback API or even
|
|
|
|
simpler buffer-to-buffer API would allow smaller implementation.
|
|
|
|
|
|
|
|
LZMA SDK contains smaller LZMA decoder written in ANSI-C than
|
|
|
|
liblzma, so you may want to take a look at that code. However,
|
|
|
|
it doesn't (at least not yet) support the new .lzma Stream format.
|
|
|
|
|
|
|
|
|
|
|
|
Documentation
|
|
|
|
|
|
|
|
There's no other documentation than the public headers and this
|
|
|
|
text yet. Real docs will be written some day, I hope.
|
|
|
|
|