Remove docs that are too outdated to be updated

(rewrite will be better).
This commit is contained in:
Lasse Collin 2009-05-01 11:28:52 +03:00
parent 0255401e57
commit be06858d5c
5 changed files with 0 additions and 956 deletions

View File

@ -1,324 +0,0 @@
Advanced features of liblzma
----------------------------
0. Introduction
Most developers need only the basic features of liblzma. These
features allow single-threaded encoding and decoding of .lzma files
in streamed mode.
In some cases developers want more. The .lzma file format is
designed to allow multi-threaded encoding and decoding and limited
random-access reading. These features are possible in non-streamed
mode and limitedly also in streamed mode.
To take advange of these features, the application needs a custom
.lzma file format handler. liblzma provides a set of tools to ease
this task, but it's still quite a bit of work to get a good custom
.lzma handler done.
1. Where to begin
Start by reading the .lzma file format specification. Understanding
the basics of the .lzma file structure is required to implement a
custom .lzma file handler and to understand the rest of this document.
2. The basic components
2.1. Stream Header and tail
Stream Header begins the .lzma Stream and Stream tail ends it. Stream
Header is defined in the file format specification, but Stream tail
isn't (thus I write "tail" with a lower-case letter). Stream tail is
simply the Stream Flags and the Footer Magic Bytes fields together.
It was done this way in liblzma, because the Block coders take care
of the rest of the stuff in the Stream Footer.
For now, the size of Stream Header is fixed to 11 bytes. The header
<lzma/stream_flags.h> defines LZMA_STREAM_HEADER_SIZE, which you
should use instead of a hardcoded number. Similarly, Stream tail
is fixed to 3 bytes, and there is a constant LZMA_STREAM_TAIL_SIZE.
It is possible, that a future version of the .lzma format will have
variable-sized Stream Header and tail. As of writing, this seems so
unlikely though, that it was considered simplest to just use a
constant instead of providing a functions to get and store the sizes
of the Stream Header and tail.
2.x. Stream tail
For now, the size of Stream tail is fixed to 3 bytes. The header
<lzma/stream_flags.h> defines LZMA_STREAM_TAIL_SIZE, which you
should use instead of a hardcoded number.
3. Keeping track of size information
The lzma_info_* functions found from <lzma/info.h> should ease the
task of keeping track of sizes of the Blocks and also the Stream
as a whole. Using these functions is strongly recommended, because
there are surprisingly many situations where an error can occur,
and these functions check for possible errors every time some new
information becomes available.
If you find lzma_info_* functions lacking something that you would
find useful, please contact the author.
3.1. Start offset of the Stream
If you are storing the .lzma Stream inside anothe file format, or
for some other reason are placing the .lzma Stream to somewhere
else than to the beginning of the file, you should tell the starting
offset of the Stream using lzma_info_start_offset_set().
The start offset of the Stream is used for two distinct purporses.
First, knowing the start offset of the Stream allows
lzma_info_alignment_get() to correctly calculate the alignment of
every Block. This information is given to the Block encoder, which
will calculate the size of Header Padding so that Compressed Data
is alignment at an optimal offset.
Another use for start offset of the Stream is in random-access
reading. If you set the start offset of the Stream, lzma_info_locate()
will be able to calculate the offset relative to the beginning of the
file containing the Stream (instead of offset relative to the
beginning of the Stream).
3.2. Size of Stream Header
While the size of Stream Header is constant (11 bytes) in the current
version of the .lzma file format, this may change in future.
3.3. Size of Header Metadata Block
This information is needed when doing random-access reading, and
to verify the value of this field stored in Footer Metadata Block.
3.4. Total Size of the Data Blocks
3.5. Uncompressed Size of Data Blocks
3.6. Index
x. Alignment
There are a few slightly different types of alignment issues when
working with .lzma files.
The .lzma format doesn't strictly require any kind of alignment.
However, if the encoder carefully optimizes the alignment in all
situations, it can improve compression ratio, speed of the encoder
and decoder, and slightly help if the files get damaged and need
recovery.
Alignment has the most significant effect compression ratio FIXME
x.1. Compression ratio
Some filters take advantage of the alignment of the input data.
To get the best compression ratio, make sure that you feed these
filters correctly aligned data.
Some filters (e.g. LZMA) don't necessarily mind too much if the
input doesn't match the preferred alignment. With these filters
the penalty in compression ratio depends on the specific type of
data being compressed.
Other filters (e.g. PowerPC executable filter) won't work at all
with data that is improperly aligned. While the data can still
be de-filtered back to its original form, the benefit of the
filtering (better compression ratio) is completely lost, because
these filters expect certain patterns at properly aligned offsets.
The compression ratio may even worse with incorrectly aligned input
than without the filter.
x.1.1. Inter-filter alignment
When there are multiple filters chained, checking the alignment can
be useful not only with the input of the first filter and output of
the last filter, but also between the filters.
Inter-filter alignment important especially with the Subblock filter.
x.1.2. Further compression with external tools
This is relatively rare situation in practice, but still worth
understanding.
Let's say that there are several SPARC executables, which are each
filtered to separate .lzma files using only the SPARC filter. If
Uncompressed Size is written to the Block Header, the size of Block
Header may vary between the .lzma files. If no Padding is used in
the Block Header to correct the alignment, the starting offset of
the Compressed Data field will be differently aligned in different
.lzma files.
All these .lzma files are archived into a single .tar archive. Due
to nature of the .tar format, every file is aligned inside the
archive to an offset that is a multiple of 512 bytes.
The .tar archive is compressed into a new .lzma file using the LZMA
filter with options, that prefer input alignment of four bytes. Now
if the independent .lzma files don't have the same alignment of
the Compressed Data fields, the LZMA filter will be unable to take
advantage of the input alignment between the files in the .tar
archive, which reduces compression ratio.
Thus, even if you have only single Block per file, it can be good for
compression ratio to align the Compressed Data to optimal offset.
x.2. Speed
Most modern computers are faster when multi-byte data is located
at aligned offsets in RAM. Proper alignment of the Compressed Data
fields can slightly increase the speed of some filters.
x.3. Recovery
Aligning every Block Header to start at an offset with big enough
alignment may ease or at least speed up recovery of broken files.
y. Typical usage cases
y.x. Parsing the Stream backwards
You may need to parse the Stream backwards if you need to get
information such as the sizes of the Stream, Index, or Extra.
The basic procedure to do this follows.
Locate the end of the Stream. If the Stream is stored as is in a
standalone .lzma file, simply seek to the end of the file and start
reading backwards using appropriate buffer size. The file format
specification allows arbitrary amount of Footer Padding (zero or more
NUL bytes), which you skip before trying to decode the Stream tail.
Once you have located the end of the Stream (a non-NULL byte), make
sure you have at least the last LZMA_STREAM_TAIL_SIZE bytes of the
Stream in a buffer. If there isn't enough bytes left from the file,
the file is too small to contain a valid Stream. Decode the Stream
tail using lzma_stream_tail_decoder(). Store the offset of the first
byte of the Stream tail; you will need it later.
You may now want to do some internal verifications e.g. if the Check
type is supported by the liblzma build you are using.
Decode the Backward Size field with lzma_vli_reverse_decode(). The
field is at maximum of LZMA_VLI_BYTES_MAX bytes long. Check that
Backward Size is not zero. Store the offset of the first byte of
the Backward Size; you will need it later.
Now you know the Total Size of the last Block of the Stream. It's the
value of Backward Size plus the size of the Backward Size field. Note
that you cannot use lzma_vli_size() to calculate the size since there
might be padding; you need to use the real observed size of the
Backward Size field.
At this point, the operation continues differently for Single-Block
and Multi-Block Streams.
y.x.1. Single-Block Stream
There might be Uncompressed Size field present in the Stream Footer.
You cannot know it for sure unless you have already parsed the Block
Header earlier. For security reasons, you probably want to try to
decode the Uncompressed Size field, but you must not indicate any
error if decoding fails. Later you can give the decoded Uncompressed
Size to Block decoder if Uncopmressed Size isn't otherwise known;
this prevents it from producing too much output in case of (possibly
intentionally) corrupt file.
Calculate the start offset of the Stream:
backward_offset - backward_size - LZMA_STREAM_HEADER_SIZE
backward_offset is the offset of the first byte of the Backward Size
field. Remember to check for integer overflows, which can occur with
invalid input files.
Seek to the beginning of the Stream. Decode the Stream Header using
lzma_stream_header_decoder(). Verify that the decoded Stream Flags
match the values found from Stream tail. You can use the
lzma_stream_flags_is_equal() macro for this.
Decode the Block Header. Verify that it isn't a Metadata Block, since
Single-Block Streams cannot have Metadata. If Uncompressed Size is
present in the Block Header, the value you tried to decode from the
Stream Footer must be ignored, since Uncompressed Size wasn't actually
present there. If Block Header doesn't have Uncompressed Size, and
decoding the Uncompressed Size field from the Stream Footer failed,
the file is corrupt.
If you were only looking for the Uncompressed Size of the Stream,
you now got that information, and you can stop processing the Stream.
To decode the Block, the same instructions apply as described in
FIXME. However, because you have some extra known information decoded
from the Stream Footer, you should give this information to the Block
decoder so that it can verify it while decoding:
- If Uncompressed Size is not present in the Block Header, set
lzma_options_block.uncompressed_size to the value you decoded
from the Stream Footer.
- Always set lzma_options_block.total_size to backward_size +
size_of_backward_size (you calculated this sum earlier already).
y.x.2. Multi-Block Stream
Calculate the start offset of the Footer Metadata Block:
backward_offset - backward_size
backward_offset is the offset of the first byte of the Backward Size
field. Remember to check for integer overflows, which can occur with
broken input files.
Decode the Block Header. Verify that it is a Metadata Block. Set
lzma_options_block.total_size to backward_size + size_of_backward_size
(you calculated this sum earlier already). Then decode the Footer
Metadata Block.
Store the decoded Footer Metadata to lzma_info structure using
lzma_info_set_metadata(). Set also the offset of the Backward Size
field using lzma_info_size_set(). Then you can get the start offset
of the Stream using lzma_info_size_get(). Note that any of these steps
may fail so don't omit error checking.
Seek to the beginning of the Stream. Decode the Stream Header using
lzma_stream_header_decoder(). Verify that the decoded Stream Flags
match the values found from Stream tail. You can use the
lzma_stream_flags_is_equal() macro for this.
If you were only looking for the Uncompressed Size of the Stream,
it's possible that you already have it now. If Uncompressed Size (or
whatever information you were looking for) isn't available yet,
continue by decoding also the Header Metadata Block. (If some
information is missing, the Header Metadata Block has to be present.)
Decoding the Data Blocks goes the same way as described in FIXME.
y.x.3. Variations
If you know the offset of the beginning of the Stream, you may want
to parse the Stream Header before parsing the Stream tail.

View File

@ -1,112 +0,0 @@
Hacking liblzma
---------------
0. Preface
This document gives some overall information about the internals of
liblzma, which should make it easier to start reading and modifying
the code.
1. Programming language
liblzma was written in C99. If you use GCC, this means that you need
at least GCC 3.x.x. GCC 2 isn't and won't be supported.
Some GCC-specific extensions are used *conditionally*. They aren't
required to build a full-featured library. Don't make the code rely
on any non-standard compiler extensions or even C99 features that
aren't portable between almost-C99 compatible compilers (for example
non-static inlines).
The public API headers are in C89. This is to avoid frustrating those
who maintain programs, which are strictly in C89 or C++.
An assumption about sizeof(size_t) is made. If this assumption is
wrong, some porting is probably needed:
sizeof(uint32_t) <= sizeof(size_t) <= sizeof(uint64_t)
2. Internal vs. external API
Input Output
v Application ^
| liblzma public API |
| Stream coder |
| Block coder |
| Filter coder |
| ... |
v Filter coder ^
Application
`-- liblzma public API
`-- Stream coder
|-- Stream info handler
|-- Stream Header coder
|-- Block Header coder
| `-- Filter Flags coder
|-- Metadata coder
| `-- Block coder
| `-- Filter 0
| `-- Filter 1
| ...
|-- Data Block coder
| `-- Filter 0
| `-- Filter 1
| ...
`-- Stream tail coder
x. Designing new filters
All filters must be designed so that the decoder cannot consume
arbitrary amount input without producing any decoded output. Failing
to follow this rule makes liblzma vulnerable to DoS attacks if
untrusted files are decoded (usually they are untrusted).
An example should clarify the reason behind this requirement: There
are two filters in the chain. The decoder of the first filter produces
huge amount of output (many gigabytes or more) with a few bytes of
input, which gets passed to the decoder of the second filter. If the
data passed to the second filter is interpreted as something that
produces no output (e.g. padding), the filter chain as a whole
produces no output and consumes no input for a long period of time.
The above problem was present in the first versions of the Subblock
filter. A tiny .lzma file could have taken several years to decode
while it wouldn't produce any output at all. The problem was fixed
by adding limits for number of consecutive Padding bytes, and requiring
that some decoded output must be produced between Set Subfilter and
Unset Subfilter.
x. Implementing new filters
If the filter supports embedding End of Payload Marker, make sure that
when your filter detects End of Payload Marker,
- the usage of End of Payload Marker is actually allowed (i.e. End
of Input isn't used); and
- it also checks that there is no more input coming from the next
filter in the chain.
The second requirement is slightly tricky. It's possible that the next
filter hasn't returned LZMA_STREAM_END yet. It may even need a few
bytes more input before it will do so. You need to give it as much
input as it needs, and verify that it doesn't produce any output.
Don't call the next filter in the chain after it has returned
LZMA_STREAM_END (except in encoder if action == LZMA_SYNC_FLUSH).
It will result undefined behavior.
Be pedantic. If the input data isn't exactly valid, reject it.
At the moment, liblzma isn't modular. You will need to edit several
files in src/liblzma/common to include support for a new filter. grep
for LZMA_FILTER_LZMA to locate the files needing changes.

View File

@ -1,194 +0,0 @@
Introduction to liblzma
-----------------------
Writing applications to work with liblzma
liblzma API is split in several subheaders to improve readability and
maintainance. The subheaders must not be #included directly. lzma.h
requires that certain integer types and macros are available when
the header is #included. On systems that have inttypes.h that conforms
to C99, the following will work:
#include <sys/types.h>
#include <inttypes.h>
#include <lzma.h>
Those who have used zlib should find liblzma's API easy to use.
To developers who haven't used zlib before, I recommend learning
zlib first, because zlib has excellent documentation.
While the API is similar to that of zlib, there are some major
differences, which are summarized below.
For basic stream encoding, zlib has three functions (deflateInit(),
deflate(), and deflateEnd()). Similarly, there are three functions
for stream decoding (inflateInit(), inflate(), and inflateEnd()).
liblzma has only single coding and ending function. Thus, to
encode one may use, for example, lzma_stream_encoder_single(),
lzma_code(), and lzma_end(). Simlarly for decoding, one may
use lzma_auto_decoder(), lzma_code(), and lzma_end().
zlib has deflateReset() and inflateReset() to reset the stream
structure without reallocating all the memory. In liblzma, all
coder initialization functions are like zlib's reset functions:
the first-time initializations are done with the same functions
as the reinitializations (resetting).
To make all this work, liblzma needs to know when lzma_stream
doesn't already point to an allocated and initialized coder.
This is achieved by initializing lzma_stream structure with
LZMA_STREAM_INIT (static initialization) or LZMA_STREAM_INIT_VAR
(for exampple when new lzma_stream has been allocated with malloc()).
This initialization should be done exactly once per lzma_stream
structure to avoid leaking memory. Calling lzma_end() will leave
lzma_stream into a state comparable to the state achieved with
LZMA_STREAM_INIT and LZMA_STREAM_INIT_VAR.
Example probably clarifies a lot. With zlib, compression goes
roughly like this:
z_stream strm;
deflateInit(&strm, level);
deflate(&strm, Z_RUN);
deflate(&strm, Z_RUN);
...
deflate(&strm, Z_FINISH);
deflateEnd(&strm) or deflateReset(&strm)
With liblzma, it's slightly different:
lzma_stream strm = LZMA_STREAM_INIT;
lzma_stream_encoder_single(&strm, &options);
lzma_code(&strm, LZMA_RUN);
lzma_code(&strm, LZMA_RUN);
...
lzma_code(&strm, LZMA_FINISH);
lzma_end(&strm) or reinitialize for new coding work
Reinitialization in the last step can be any function that can
initialize lzma_stream; it doesn't need to be the same function
that was used for the previous initialization. If it is the same
function, liblzma will usually be able to re-use most of the
existing memory allocations (depends on how much the initialization
options change). If you reinitialize with different function,
liblzma will automatically free the memory of the previous coder.
File formats
liblzma supports multiple container formats for the compressed data.
Different initialization functions initialize the lzma_stream to
process different container formats. See the details from the public
header files.
The following functions are the most commonly used:
- lzma_stream_encoder_single(): Encodes Single-Block Stream; this
the recommended format for most purporses.
- lzma_alone_encoder(): Useful if you need to encode into the
legacy LZMA_Alone format.
- lzma_auto_decoder(): Decoder that automatically detects the
file format; recommended when you decode compressed files on
disk, because this way compatibility with the legacy LZMA_Alone
format is transparent.
- lzma_stream_decoder(): Decoder for Single- and Multi-Block
Streams; this is good if you want to accept only .lzma Streams.
Filters
liblzma supports multiple filters (algorithm implementations). The new
.lzma format supports filter-chain having up to seven filters. In the
filter chain, the output of one filter is input of the next filter in
the chain. The legacy LZMA_Alone format supports only one filter, and
that must always be LZMA.
General-purporse compression:
LZMA The main algorithm of liblzma (surprise!)
Branch/Call/Jump filters for executables:
x86 This filter is known as BCJ in 7-Zip
IA64 IA-64 (Itanium)
PowerPC Big endian PowerPC
ARM
ARM-Thumb
SPARC
Other filters:
Copy Dummy filter that simply copies all the data
from input to output.
Subblock Multi-purporse filter, that can
- embed End of Payload Marker if the previous
filter in the chain doesn't support it; and
- apply Subfilters, which filter only part
of the same compressed Block in the Stream.
Branch/Call/Jump filters never change the size of the data. They
should usually be used as a pre-filter for some compression filter
like LZMA.
Integrity checks
The .lzma Stream format uses CRC32 as the integrity check for
different file format headers. It is possible to omit CRC32 from
the Block Headers, but not from Stream Header. This is the reason
why CRC32 code cannot be disabled when building liblzma (in addition,
the LZMA encoder uses CRC32 for hashing, so that's another reason).
The integrity check of the actual data is calculated from the
uncompressed data. This check can be CRC32, CRC64, or SHA256.
It can also be omitted completely, although that usually is not
a good thing to do. There are free IDs left, so support for new
checks algorithms can be added later.
API and ABI stability
The API and ABI of liblzma isn't stable yet, although no huge
changes should happen. One potential place for change is the
lzma_options_subblock structure.
In the 4.42.0alpha phase, the shared library version number won't
be updated even if ABI breaks. I don't want to track the ABI changes
yet. Just rebuild everything when you upgrade liblzma until we get
to the beta stage.
Size of the library
While liblzma isn't huge, it is quite far from the smallest possible
LZMA implementation: full liblzma binary (with support for all
filters and other features) is way over 100 KiB, but the plain raw
LZMA decoder is only 5-10 KiB.
To decrease the size of the library, you can omit parts of the library
by passing certain options to the `configure' script. Disabling
everything but the decoders of the require filters will usually give
you a small enough library, but if you need a decoder for example
embedded in the operating system kernel, the code from liblzma probably
isn't suitable as is.
If you need a minimal implementation supporting .lzma Streams, you
may need to do partial rewrite. liblzma uses stateful API like zlib.
That increases the size of the library. Using callback API or even
simpler buffer-to-buffer API would allow smaller implementation.
LZMA SDK contains smaller LZMA decoder written in ANSI-C than
liblzma, so you may want to take a look at that code. However,
it doesn't (at least not yet) support the new .lzma Stream format.
Documentation
There's no other documentation than the public headers and this
text yet. Real docs will be written some day, I hope.

View File

@ -1,219 +0,0 @@
Using liblzma securely
----------------------
0. Introduction
This document discusses how to use liblzma securely. There are issues
that don't apply to zlib or libbzip2, so reading this document is
strongly recommended even for those who are very familiar with zlib
or libbzip2.
While making liblzma itself as secure as possible is essential, it's
out of scope of this document.
1. Memory usage
The memory usage of liblzma varies a lot.
1.1. Problem sources
1.1.1. Block coder
The memory requirements of Block encoder depend on the used filters
and their settings. The memory requirements of the Block decoder
depend on the which filters and with which filter settings the Block
was encoded. Usually the memory requirements of a decoder are equal
or less than the requirements of the encoder with the same settings.
While the typical memory requirements to decode a Block is from a few
hundred kilobytes to tens of megabytes, a maliciously constructed
files can require a lot more RAM to decode. With the current filters,
the maximum amount is about 7 GiB. If you use multi-threaded decoding,
every Block can require this amount of RAM, thus a four-threaded
decoder could suddenly try to allocate 28 GiB of RAM.
If you don't limit the maximum memory usage in any way, and there are
no resource limits set on the operating system side, one malicious
input file can run the system out of memory, or at least make it swap
badly for a long time. This is exceptionally bad on servers e.g.
email server doing virus scanning on incoming messages.
1.1.2. Metadata decoder
Multi-Block .lzma files contain at least one Metadata Block.
Externally the Metadata Blocks are similar to Data Blocks, so all
the issues mentioned about memory usage of Data Blocks applies to
Metadata Blocks too.
The uncompressed content of Metadata Blocks contain information about
the Stream as a whole, and optionally some Extra Records. The
information about the Stream is kept in liblzma's internal data
structures in RAM. Extra Records can contain arbitrary data. They are
not interpreted by liblzma, but liblzma will provide them to the
application in uninterpreted form if the application wishes so.
Usually the Uncompressed Size of a Metadata Block is small. Even on
extreme cases, it shouldn't be much bigger than a few megabytes. Once
the Metadata has been parsed into native data structures in liblzma,
it usually takes a little more memory than in the encoded form. For
all normal files, this is no problem, since the resulting memory usage
won't be too much.
The problem is that a maliciously constructed Metadata Block can
contain huge amount of "information", which liblzma will try to store
in its internal data structures. This may cause liblzma to allocate
all the available RAM unless some kind of resource usage limits are
applied.
Note that the Extra Records in Metadata are always parsed but, but
memory is allocated for them only if the application has requested
liblzma to provide the Extra Records to the application.
1.2. Solutions
If you need to decode files from untrusted sources (most people do),
you must limit the memory usage to avoid denial of service (DoS)
conditions caused by malicious input files.
The first step is to find out how much memory you are allowed consume
at maximum. This may be a hardcoded constant or derived from the
available RAM; whatever is appropriate in the application.
The simplest solution is to use setrlimit() if the kernel supports
RLIMIT_AS, which limits the memory usage of the whole process.
For more portable and fine-grained limiting, you can use
memory limiter functions found from <lzma/memlimit.h>.
1.2.1. Encoder
lzma_memory_usage() will give you a rough estimate about the memory
usage of the given filter chain. To dramatically simplify the internal
implementation, this function doesn't take into account all the small
helper data structures needed in various places; only the structures
with significant memory usage are taken into account. Still, the
accuracy of this function should be well within a mebibyte.
The Subblock filter is a special case. If a Subfilter has been
specified, it isn't taken into account when lzma_memory_usage()
calculates the memory usage. You need to calculate the memory usage
of the Subfilter separately.
Keeping track of Blocks in a Multi-Block Stream takes a few dozen
bytes of RAM per Block (size of the lzma_index structure plus overhead
of malloc()). It isn't a good idea to put tens of thousands of Blocks
into a Stream unless you have a very good reason to do so (compressed
dictionary could be an example of such situation).
Also keep the number and sizes of Extra Records sane. If you produce
the list of Extra Records automatically from some untrusted source,
you should not only validate the content of these Records, but also
their memory usage.
1.2.2. Decoder
A single-threaded decoder should simply use a memory limiter and
indicate an error if it runs out of memory.
Memory-limiting with multi-threaded decoding is tricky. The simple
solution is to divide the maximum allowed memory usage with the
maximum allowed threads, and give each Block decoder their own
independent lzma_memory_limiter. The drawback is that if one Block
needs notably more RAM than any other Block, the decoder will run out
of memory when in reality there would be plenty of free RAM.
An attractive alternative would be using shared lzma_memory_limiter.
Depending on the application and the expected type of input, this may
either be the best solution or a source of hard-to-repeat problems.
Consider the following requirements:
- You use a maximum of n threads.
- x(i) is the decoder memory requirements of the Block number i
in an expected input Stream.
- The memory limiter is set to higher value than the sum of n
highest values x(i).
(If you are better at explaining the above conditions, please
contribute your improved version.)
If the above conditions aren't met, it is possible that the decoding
will fail unpredictably. That is, on the same machine using the same
settings, the decoding may sometimes succeed and sometimes fail. This
is because sometimes threads may run so that the Blocks with highest
memory usage are tried to be decoded at the same time.
Most .lzma files have all the Blocks encoded with identical settings,
or at least the memory usage won't vary dramatically. That's why most
multi-threaded decoders probably want to use the simple "separate
lzma_memory_limiter for each thread" solution, possibly falling back
to single-threaded mode in case the per-thread memory limits aren't
enough in multi-threaded mode.
FIXME: Memory usage of Stream info.
[
]
2. Huge uncompressed output
2.1. Data Blocks
Decoding a tiny .lzma file can produce huge amount of uncompressed
output. There is an example file of 45 bytes, which decodes to 64 PiB
(that's 2^56 bytes). Uncompressing such a file to disk is likely to
fill even a bigger disk array. If the data is written to a pipe, it
may not fill the disk, but would still take very long time to finish.
To avoid denial of service conditions caused by huge amount of
uncompressed output, applications using liblzma should use some method
to limit the amount of output produced. The exact method depends on
the application.
All valid .lzma Streams make it possible to find out the uncompressed
size of the Stream without actually uncompressing the data. This
information is available in at least one of the Metadata Blocks.
Once the uncompressed size is parsed, the decoder can verify that
it doesn't exceed certain limits (e.g. available disk space).
When the uncompressed size is known, the decoder can actively keep
track of the amount of output produced so far, and that it doesn't
exceed the known uncompressed size. If it does exceed, the file is
known to be corrupt and an error should be indicated without
continuing to decode the rest of the file.
Unfortunately, finding the uncompressed size beforehand is often
possible only in non-streamed mode, because the needed information
could be in the Footer Metdata Block, which (obviously) is at the
end of the Stream. In purely streamed mode decoding, one may need to
use some rough arbitrary limits to prevent the problems described in
the beginning of this section.
2.2. Metadata
Metadata is stored in Metadata Blocks, which are very similar to
Data Blocks. Thus, the uncompressed size can be huge just like with
Data Blocks. The difference is, that the contents of Metadata Blocks
aren't given to the application as is, but parsed by liblzma. Still,
reading through a huge Metadata can take very long time, effectively
creating a denial of service like piping decoded a Data Block to
another process would do.
At first it would seem that using a memory limiter would prevent
this issue as a side effect. But it does so only if the application
requests liblzma to allocate the Extra Records and provide them to
the application. If Extra Records aren't requested, they aren't
allocated either. Still, the Extra Records are being read through
to validate that the Metadata is in proper format.
The solution is to limit the Uncompressed Size of a Metadata Block
to some relatively large value. This will make liblzma to give an
error when the given limit is reached.

View File

@ -1,107 +0,0 @@
Introduction to the lzma command line tool
------------------------------------------
Overview
The lzma command line tool is similar to gzip and bzip2, but for
compressing and uncompressing .lzma files.
Supported file formats
By default, the tool creates files in the new .lzma format. This can
be overriden with --format=FMT command line option. Use --format=alone
to create files in the old LZMA_Alone format.
By default, the tool uncompresses both the new .lzma format and
LZMA_Alone format. This is to make it transparent to switch from
the old LZMA_Alone format to the new .lzma format. Since both
formats use the same filename suffix, average user should never
notice which format was used.
Differences to gzip and bzip2
Standard input and output
Both gzip and bzip2 refuse to write compressed data to a terminal and
read compressed data from a terminal. With gzip (but not with bzip2),
this can be overriden with the `--force' option. lzma follows the
behavior of gzip here.
Usage of LZMA_OPT environment variable
gzip and bzip2 read GZIP and BZIP2 environment variables at startup.
These variables may contain extra command line options.
gzip and bzip2 allow passing not only options, but also end-of-options
indicator (`--') and filenames via the environment variable. No quoting
is supported with the filenames.
Here are examples with gzip. bzip2 behaves identically.
bash$ echo asdf > 'foo bar'
bash$ GZIP='"foo bar"' gzip
gzip: "foo: No such file or directory
gzip: bar": No such file or directory
bash$ GZIP=-- gzip --help
gzip: --help: No such file or directory
lzma silently ignores all non-option arguments given via the
environment variable LZMA_OPT. Like on the command line, everything
after `--' is taken as non-options, and thus ignored in LZMA_OPT.
bash$ LZMA_OPT='--help' lzma --version # Displays help
bash$ LZMA_OPT='-- --help' lzma --version # Displays version
Filter chain presets
Like in gzip and bzip2, lzma supports numbered presets from 1 to 9
where 1 is the fastest and 9 the best compression. 1 and 2 are for
fast compressing with small memory usage, 3 to 6 for good compression
ratio with medium memory usage, and 7 to 9 for excellent compression
ratio with higher memory requirements. The default is 7 if memory
usage limit allows.
In future, there will probably be an option like --preset=NAME, which
will contain more special presets for specific file types.
It's also possible that there will be some heuristics to select good
filters. For example, the tool could detect when a .tar archive is
being compressed, and enable x86 filter only for those files in the
.tar archive that are ELF or PE executables for x86.
Specifying custom filter chains
Custom filter chains are specified by using long options with the name
of the filters in correct order. For example, to pass the input data to
the x86 filter and the output of that to the LZMA filter, the following
command will do:
lzma --x86 --lzma filename
Some filters accept options, which are specified as a comma-separated
list of key=value pairs:
lzma --delta=distance=4 --lzma=dict=4Mi,lc=8,lp=2 filename
Memory usage control
By default, the command line tool limits memory usage to 1/3 of the
available physical RAM. If no preset or custom filter chain has been
given, the default preset will be used. If the memory limit is too
low for the default preset, the tool will silently switch to lower
preset.
When a preset or a custom filter chain has been specified and the
memory limit is too low, an error message is displayed and no files
are processed.
If the decoder hits the memory usage limit, an error is displayed and
no more files are processed.