56 lines
1.9 KiB
Markdown
56 lines
1.9 KiB
Markdown
# Issues in the XZ file format
|
|
|
|
During the development of the xz package for Go a number of issues with
|
|
the xz file format were observed. They are documented here to help a
|
|
later development of an improved format.
|
|
|
|
# xz file format
|
|
|
|
## Consistency
|
|
|
|
## General
|
|
|
|
Packets should either be constant size or should have encoded the size
|
|
in the header. File header and footer are constant size and the block
|
|
header has the size encoded. The index doesn't fulfill the criteria even
|
|
when its size is included in the footer.
|
|
|
|
## Index
|
|
|
|
The index doesn't have the size in the header. So in a stream you are
|
|
forced to read the whole index to identify its length.
|
|
|
|
The index should have made optional. This would require to remove the
|
|
index size from the footer and include its own footer in the index.
|
|
|
|
## Padding
|
|
|
|
The padding should allow direct mapping of the CRC values into memory, but it
|
|
wastes bytes bearing no information. This is certainly not optimal for a
|
|
compression format. It is argued alignment makes it faster to read and
|
|
write the checksum values, but the time spent there is much less than on
|
|
encoding and decoding itself.
|
|
|
|
## Filters for each block
|
|
|
|
Filters should have been defined in front of blocks. This way they
|
|
would not need to be repeated.
|
|
|
|
# LZMA2
|
|
|
|
## Consistent header byte.
|
|
|
|
LZMA2 consists of a series of chunks with a header byte. The header byte
|
|
has a different format depending on whether it is an uncompressed or
|
|
compressed chunk. This has the consequence a complete reset of state,
|
|
properties and dictionary is not possible with an uncompressed chunk.
|
|
The encoder has to keep a state variable tracking a dictionary reset in
|
|
an uncompressed chunk to ensure that the flags are added in the first
|
|
compressed chunk to follow. This complicates the implementation of the
|
|
encoder and decoder.
|
|
|
|
## Dictionary capacity is not encoded
|
|
|
|
LZMA2 doesn't encode the dictionary capacity, so LZMA2 doesn't work
|
|
standalone.
|