Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
hashcat
GitHub Repository: hashcat/hashcat
Path: blob/master/deps/LZMA-SDK/DOC/lzma-specification.txt
9416 views
1
LZMA specification (DRAFT version)
2
----------------------------------
3
4
Author: Igor Pavlov
5
Date: 2015-06-14
6
7
This specification defines the format of LZMA compressed data and lzma file format.
8
9
Notation
10
--------
11
12
We use the syntax of C++ programming language.
13
We use the following types in C++ code:
14
unsigned - unsigned integer, at least 16 bits in size
15
int - signed integer, at least 16 bits in size
16
UInt64 - 64-bit unsigned integer
17
UInt32 - 32-bit unsigned integer
18
UInt16 - 16-bit unsigned integer
19
Byte - 8-bit unsigned integer
20
bool - boolean type with two possible values: false, true
21
22
23
lzma file format
24
================
25
26
The lzma file contains the raw LZMA stream and the header with related properties.
27
28
The files in that format use ".lzma" extension.
29
30
The lzma file format layout:
31
32
Offset Size Description
33
34
0 1 LZMA model properties (lc, lp, pb) in encoded form
35
1 4 Dictionary size (32-bit unsigned integer, little-endian)
36
5 8 Uncompressed size (64-bit unsigned integer, little-endian)
37
13 Compressed data (LZMA stream)
38
39
LZMA properties:
40
41
name Range Description
42
43
lc [0, 8] the number of "literal context" bits
44
lp [0, 4] the number of "literal pos" bits
45
pb [0, 4] the number of "pos" bits
46
dictSize [0, 2^32 - 1] the dictionary size
47
48
The following code encodes LZMA properties:
49
50
void EncodeProperties(Byte *properties)
51
{
52
properties[0] = (Byte)((pb * 5 + lp) * 9 + lc);
53
Set_UInt32_LittleEndian(properties + 1, dictSize);
54
}
55
56
If the value of dictionary size in properties is smaller than (1 << 12),
57
the LZMA decoder must set the dictionary size variable to (1 << 12).
58
59
#define LZMA_DIC_MIN (1 << 12)
60
61
unsigned lc, pb, lp;
62
UInt32 dictSize;
63
UInt32 dictSizeInProperties;
64
65
void DecodeProperties(const Byte *properties)
66
{
67
unsigned d = properties[0];
68
if (d >= (9 * 5 * 5))
69
throw "Incorrect LZMA properties";
70
lc = d % 9;
71
d /= 9;
72
pb = d / 5;
73
lp = d % 5;
74
dictSizeInProperties = 0;
75
for (int i = 0; i < 4; i++)
76
dictSizeInProperties |= (UInt32)properties[i + 1] << (8 * i);
77
dictSize = dictSizeInProperties;
78
if (dictSize < LZMA_DIC_MIN)
79
dictSize = LZMA_DIC_MIN;
80
}
81
82
If "Uncompressed size" field contains ones in all 64 bits, it means that
83
uncompressed size is unknown and there is the "end marker" in stream,
84
that indicates the end of decoding point.
85
In opposite case, if the value from "Uncompressed size" field is not
86
equal to ((2^64) - 1), the LZMA stream decoding must be finished after
87
specified number of bytes (Uncompressed size) is decoded. And if there
88
is the "end marker", the LZMA decoder must read that marker also.
89
90
91
The new scheme to encode LZMA properties
92
----------------------------------------
93
94
If LZMA compression is used for some another format, it's recommended to
95
use a new improved scheme to encode LZMA properties. That new scheme was
96
used in xz format that uses the LZMA2 compression algorithm.
97
The LZMA2 is a new compression algorithm that is based on the LZMA algorithm.
98
99
The dictionary size in LZMA2 is encoded with just one byte and LZMA2 supports
100
only reduced set of dictionary sizes:
101
(2 << 11), (3 << 11),
102
(2 << 12), (3 << 12),
103
...
104
(2 << 30), (3 << 30),
105
(2 << 31) - 1
106
107
The dictionary size can be extracted from encoded value with the following code:
108
109
dictSize = (p == 40) ? 0xFFFFFFFF : (((UInt32)2 | ((p) & 1)) << ((p) / 2 + 11));
110
111
Also there is additional limitation (lc + lp <= 4) in LZMA2 for values of
112
"lc" and "lp" properties:
113
114
if (lc + lp > 4)
115
throw "Unsupported properties: (lc + lp) > 4";
116
117
There are some advantages for LZMA decoder with such (lc + lp) value
118
limitation. It reduces the maximum size of tables allocated by decoder.
119
And it reduces the complexity of initialization procedure, that can be
120
important to keep high speed of decoding of big number of small LZMA streams.
121
122
It's recommended to use that limitation (lc + lp <= 4) for any new format
123
that uses LZMA compression. Note that the combinations of "lc" and "lp"
124
parameters, where (lc + lp > 4), can provide significant improvement in
125
compression ratio only in some rare cases.
126
127
The LZMA properties can be encoded into two bytes in new scheme:
128
129
Offset Size Description
130
131
0 1 The dictionary size encoded with LZMA2 scheme
132
1 1 LZMA model properties (lc, lp, pb) in encoded form
133
134
135
The RAM usage
136
=============
137
138
The RAM usage for LZMA decoder is determined by the following parts:
139
140
1) The Sliding Window (from 4 KiB to 4 GiB).
141
2) The probability model counter arrays (arrays of 16-bit variables).
142
3) Some additional state variables (about 10 variables of 32-bit integers).
143
144
145
The RAM usage for Sliding Window
146
--------------------------------
147
148
There are two main scenarios of decoding:
149
150
1) The decoding of full stream to one RAM buffer.
151
152
If we decode full LZMA stream to one output buffer in RAM, the decoder
153
can use that output buffer as sliding window. So the decoder doesn't
154
need additional buffer allocated for sliding window.
155
156
2) The decoding to some external storage.
157
158
If we decode LZMA stream to external storage, the decoder must allocate
159
the buffer for sliding window. The size of that buffer must be equal
160
or larger than the value of dictionary size from properties of LZMA stream.
161
162
In this specification we describe the code for decoding to some external
163
storage. The optimized version of code for decoding of full stream to one
164
output RAM buffer can require some minor changes in code.
165
166
167
The RAM usage for the probability model counters
168
------------------------------------------------
169
170
The size of the probability model counter arrays is calculated with the
171
following formula:
172
173
size_of_prob_arrays = 1846 + 768 * (1 << (lp + lc))
174
175
Each probability model counter is 11-bit unsigned integer.
176
If we use 16-bit integer variables (2-byte integers) for these probability
177
model counters, the RAM usage required by probability model counter arrays
178
can be estimated with the following formula:
179
180
RAM = 4 KiB + 1.5 KiB * (1 << (lp + lc))
181
182
For example, for default LZMA parameters (lp = 0 and lc = 3), the RAM usage is
183
184
RAM_lc3_lp0 = 4 KiB + 1.5 KiB * 8 = 16 KiB
185
186
The maximum RAM state usage is required for decoding the stream with lp = 4
187
and lc = 8:
188
189
RAM_lc8_lp4 = 4 KiB + 1.5 KiB * 4096 = 6148 KiB
190
191
If the decoder uses LZMA2's limited property condition
192
(lc + lp <= 4), the RAM usage will be not larger than
193
194
RAM_lc_lp_4 = 4 KiB + 1.5 KiB * 16 = 28 KiB
195
196
197
The RAM usage for encoder
198
-------------------------
199
200
There are many variants for LZMA encoding code.
201
These variants have different values for memory consumption.
202
Note that memory consumption for LZMA Encoder can not be
203
smaller than memory consumption of LZMA Decoder for same stream.
204
205
The RAM usage required by modern effective implementation of
206
LZMA Encoder can be estimated with the following formula:
207
208
Encoder_RAM_Usage = 4 MiB + 11 * dictionarySize.
209
210
But there are some modes of the encoder that require less memory.
211
212
213
LZMA Decoding
214
=============
215
216
The LZMA compression algorithm uses LZ-based compression with Sliding Window
217
and Range Encoding as entropy coding method.
218
219
220
Sliding Window
221
--------------
222
223
LZMA uses Sliding Window compression similar to LZ77 algorithm.
224
225
LZMA stream must be decoded to the sequence that consists
226
of MATCHES and LITERALS:
227
228
- a LITERAL is a 8-bit character (one byte).
229
The decoder just puts that LITERAL to the uncompressed stream.
230
231
- a MATCH is a pair of two numbers (DISTANCE-LENGTH pair).
232
The decoder takes one byte exactly "DISTANCE" characters behind
233
current position in the uncompressed stream and puts it to
234
uncompressed stream. The decoder must repeat it "LENGTH" times.
235
236
The "DISTANCE" can not be larger than dictionary size.
237
And the "DISTANCE" can not be larger than the number of bytes in
238
the uncompressed stream that were decoded before that match.
239
240
In this specification we use cyclic buffer to implement Sliding Window
241
for LZMA decoder:
242
243
class COutWindow
244
{
245
Byte *Buf;
246
UInt32 Pos;
247
UInt32 Size;
248
bool IsFull;
249
250
public:
251
unsigned TotalPos;
252
COutStream OutStream;
253
254
COutWindow(): Buf(NULL) {}
255
~COutWindow() { delete []Buf; }
256
257
void Create(UInt32 dictSize)
258
{
259
Buf = new Byte[dictSize];
260
Pos = 0;
261
Size = dictSize;
262
IsFull = false;
263
TotalPos = 0;
264
}
265
266
void PutByte(Byte b)
267
{
268
TotalPos++;
269
Buf[Pos++] = b;
270
if (Pos == Size)
271
{
272
Pos = 0;
273
IsFull = true;
274
}
275
OutStream.WriteByte(b);
276
}
277
278
Byte GetByte(UInt32 dist) const
279
{
280
return Buf[dist <= Pos ? Pos - dist : Size - dist + Pos];
281
}
282
283
void CopyMatch(UInt32 dist, unsigned len)
284
{
285
for (; len > 0; len--)
286
PutByte(GetByte(dist));
287
}
288
289
bool CheckDistance(UInt32 dist) const
290
{
291
return dist <= Pos || IsFull;
292
}
293
294
bool IsEmpty() const
295
{
296
return Pos == 0 && !IsFull;
297
}
298
};
299
300
301
In another implementation it's possible to use one buffer that contains
302
Sliding Window and the whole data stream after uncompressing.
303
304
305
Range Decoder
306
-------------
307
308
LZMA algorithm uses Range Encoding (1) as entropy coding method.
309
310
LZMA stream contains just one very big number in big-endian encoding.
311
LZMA decoder uses the Range Decoder to extract a sequence of binary
312
symbols from that big number.
313
314
The state of the Range Decoder:
315
316
struct CRangeDecoder
317
{
318
UInt32 Range;
319
UInt32 Code;
320
InputStream *InStream;
321
322
bool Corrupted;
323
}
324
325
The notes about UInt32 type for the "Range" and "Code" variables:
326
327
It's possible to use 64-bit (unsigned or signed) integer type
328
for the "Range" and the "Code" variables instead of 32-bit unsigned,
329
but some additional code must be used to truncate the values to
330
low 32-bits after some operations.
331
332
If the programming language does not support 32-bit unsigned integer type
333
(like in case of JAVA language), it's possible to use 32-bit signed integer,
334
but some code must be changed. For example, it's required to change the code
335
that uses comparison operations for UInt32 variables in this specification.
336
337
The Range Decoder can be in some states that can be treated as
338
"Corruption" in LZMA stream. The Range Decoder uses the variable "Corrupted":
339
340
(Corrupted == false), if the Range Decoder has not detected any corruption.
341
(Corrupted == true), if the Range Decoder has detected some corruption.
342
343
The reference LZMA Decoder ignores the value of the "Corrupted" variable.
344
So it continues to decode the stream, even if the corruption can be detected
345
in the Range Decoder. To provide the full compatibility with output of the
346
reference LZMA Decoder, another LZMA Decoder implementations must also
347
ignore the value of the "Corrupted" variable.
348
349
The LZMA Encoder is required to create only such LZMA streams, that will not
350
lead the Range Decoder to states, where the "Corrupted" variable is set to true.
351
352
The Range Decoder reads first 5 bytes from input stream to initialize
353
the state:
354
355
bool CRangeDecoder::Init()
356
{
357
Corrupted = false;
358
Range = 0xFFFFFFFF;
359
Code = 0;
360
361
Byte b = InStream->ReadByte();
362
363
for (int i = 0; i < 4; i++)
364
Code = (Code << 8) | InStream->ReadByte();
365
366
if (b != 0 || Code == Range)
367
Corrupted = true;
368
return b == 0;
369
}
370
371
The LZMA Encoder always writes ZERO in initial byte of compressed stream.
372
That scheme allows to simplify the code of the Range Encoder in the
373
LZMA Encoder. If initial byte is not equal to ZERO, the LZMA Decoder must
374
stop decoding and report error.
375
376
After the last bit of data was decoded by Range Decoder, the value of the
377
"Code" variable must be equal to 0. The LZMA Decoder must check it by
378
calling the IsFinishedOK() function:
379
380
bool IsFinishedOK() const { return Code == 0; }
381
382
If there is corruption in data stream, there is big probability that
383
the "Code" value will be not equal to 0 in the Finish() function. So that
384
check in the IsFinishedOK() function provides very good feature for
385
corruption detection.
386
387
The value of the "Range" variable before each bit decoding can not be smaller
388
than ((UInt32)1 << 24). The Normalize() function keeps the "Range" value in
389
described range.
390
391
#define kTopValue ((UInt32)1 << 24)
392
393
void CRangeDecoder::Normalize()
394
{
395
if (Range < kTopValue)
396
{
397
Range <<= 8;
398
Code = (Code << 8) | InStream->ReadByte();
399
}
400
}
401
402
Notes: if the size of the "Code" variable is larger than 32 bits, it's
403
required to keep only low 32 bits of the "Code" variable after the change
404
in Normalize() function.
405
406
If the LZMA Stream is not corrupted, the value of the "Code" variable is
407
always smaller than value of the "Range" variable.
408
But the Range Decoder ignores some types of corruptions, so the value of
409
the "Code" variable can be equal or larger than value of the "Range" variable
410
for some "Corrupted" archives.
411
412
413
LZMA uses Range Encoding only with binary symbols of two types:
414
1) binary symbols with fixed and equal probabilities (direct bits)
415
2) binary symbols with predicted probabilities
416
417
The DecodeDirectBits() function decodes the sequence of direct bits:
418
419
UInt32 CRangeDecoder::DecodeDirectBits(unsigned numBits)
420
{
421
UInt32 res = 0;
422
do
423
{
424
Range >>= 1;
425
Code -= Range;
426
UInt32 t = 0 - ((UInt32)Code >> 31);
427
Code += Range & t;
428
429
if (Code == Range)
430
Corrupted = true;
431
432
Normalize();
433
res <<= 1;
434
res += t + 1;
435
}
436
while (--numBits);
437
return res;
438
}
439
440
441
The Bit Decoding with Probability Model
442
---------------------------------------
443
444
The task of Bit Probability Model is to estimate probabilities of binary
445
symbols. And then it provides the Range Decoder with that information.
446
The better prediction provides better compression ratio.
447
The Bit Probability Model uses statistical data of previous decoded
448
symbols.
449
450
That estimated probability is presented as 11-bit unsigned integer value
451
that represents the probability of symbol "0".
452
453
#define kNumBitModelTotalBits 11
454
455
Mathematical probabilities can be presented with the following formulas:
456
probability(symbol_0) = prob / 2048.
457
probability(symbol_1) = 1 - Probability(symbol_0) =
458
= 1 - prob / 2048 =
459
= (2048 - prob) / 2048
460
where the "prob" variable contains 11-bit integer probability counter.
461
462
It's recommended to use 16-bit unsigned integer type, to store these 11-bit
463
probability values:
464
465
typedef UInt16 CProb;
466
467
Each probability value must be initialized with value ((1 << 11) / 2),
468
that represents the state, where probabilities of symbols 0 and 1
469
are equal to 0.5:
470
471
#define PROB_INIT_VAL ((1 << kNumBitModelTotalBits) / 2)
472
473
The INIT_PROBS macro is used to initialize the array of CProb variables:
474
475
#define INIT_PROBS(p) \
476
{ for (unsigned i = 0; i < sizeof(p) / sizeof(p[0]); i++) p[i] = PROB_INIT_VAL; }
477
478
479
The DecodeBit() function decodes one bit.
480
The LZMA decoder provides the pointer to CProb variable that contains
481
information about estimated probability for symbol 0 and the Range Decoder
482
updates that CProb variable after decoding. The Range Decoder increases
483
estimated probability of the symbol that was decoded:
484
485
#define kNumMoveBits 5
486
487
unsigned CRangeDecoder::DecodeBit(CProb *prob)
488
{
489
unsigned v = *prob;
490
UInt32 bound = (Range >> kNumBitModelTotalBits) * v;
491
unsigned symbol;
492
if (Code < bound)
493
{
494
v += ((1 << kNumBitModelTotalBits) - v) >> kNumMoveBits;
495
Range = bound;
496
symbol = 0;
497
}
498
else
499
{
500
v -= v >> kNumMoveBits;
501
Code -= bound;
502
Range -= bound;
503
symbol = 1;
504
}
505
*prob = (CProb)v;
506
Normalize();
507
return symbol;
508
}
509
510
511
The Binary Tree of bit model counters
512
-------------------------------------
513
514
LZMA uses a tree of Bit model variables to decode symbol that needs
515
several bits for storing. There are two versions of such trees in LZMA:
516
1) the tree that decodes bits from high bit to low bit (the normal scheme).
517
2) the tree that decodes bits from low bit to high bit (the reverse scheme).
518
519
Each binary tree structure supports different size of decoded symbol
520
(the size of binary sequence that contains value of symbol).
521
If that size of decoded symbol is "NumBits" bits, the tree structure
522
uses the array of (2 << NumBits) counters of CProb type.
523
But only ((2 << NumBits) - 1) items are used by encoder and decoder.
524
The first item (the item with index equal to 0) in array is unused.
525
That scheme with unused array's item allows to simplify the code.
526
527
unsigned BitTreeReverseDecode(CProb *probs, unsigned numBits, CRangeDecoder *rc)
528
{
529
unsigned m = 1;
530
unsigned symbol = 0;
531
for (unsigned i = 0; i < numBits; i++)
532
{
533
unsigned bit = rc->DecodeBit(&probs[m]);
534
m <<= 1;
535
m += bit;
536
symbol |= (bit << i);
537
}
538
return symbol;
539
}
540
541
template <unsigned NumBits>
542
class CBitTreeDecoder
543
{
544
CProb Probs[(unsigned)1 << NumBits];
545
546
public:
547
548
void Init()
549
{
550
INIT_PROBS(Probs);
551
}
552
553
unsigned Decode(CRangeDecoder *rc)
554
{
555
unsigned m = 1;
556
for (unsigned i = 0; i < NumBits; i++)
557
m = (m << 1) + rc->DecodeBit(&Probs[m]);
558
return m - ((unsigned)1 << NumBits);
559
}
560
561
unsigned ReverseDecode(CRangeDecoder *rc)
562
{
563
return BitTreeReverseDecode(Probs, NumBits, rc);
564
}
565
};
566
567
568
LZ part of LZMA
569
---------------
570
571
LZ part of LZMA describes details about the decoding of MATCHES and LITERALS.
572
573
574
The Literal Decoding
575
--------------------
576
577
The LZMA Decoder uses (1 << (lc + lp)) tables with CProb values, where
578
each table contains 0x300 CProb values:
579
580
CProb *LitProbs;
581
582
void CreateLiterals()
583
{
584
LitProbs = new CProb[(UInt32)0x300 << (lc + lp)];
585
}
586
587
void InitLiterals()
588
{
589
UInt32 num = (UInt32)0x300 << (lc + lp);
590
for (UInt32 i = 0; i < num; i++)
591
LitProbs[i] = PROB_INIT_VAL;
592
}
593
594
To select the table for decoding it uses the context that consists of
595
(lc) high bits from previous literal and (lp) low bits from value that
596
represents current position in outputStream.
597
598
If (State > 7), the Literal Decoder also uses "matchByte" that represents
599
the byte in OutputStream at position the is the DISTANCE bytes before
600
current position, where the DISTANCE is the distance in DISTANCE-LENGTH pair
601
of latest decoded match.
602
603
The following code decodes one literal and puts it to Sliding Window buffer:
604
605
void DecodeLiteral(unsigned state, UInt32 rep0)
606
{
607
unsigned prevByte = 0;
608
if (!OutWindow.IsEmpty())
609
prevByte = OutWindow.GetByte(1);
610
611
unsigned symbol = 1;
612
unsigned litState = ((OutWindow.TotalPos & ((1 << lp) - 1)) << lc) + (prevByte >> (8 - lc));
613
CProb *probs = &LitProbs[(UInt32)0x300 * litState];
614
615
if (state >= 7)
616
{
617
unsigned matchByte = OutWindow.GetByte(rep0 + 1);
618
do
619
{
620
unsigned matchBit = (matchByte >> 7) & 1;
621
matchByte <<= 1;
622
unsigned bit = RangeDec.DecodeBit(&probs[((1 + matchBit) << 8) + symbol]);
623
symbol = (symbol << 1) | bit;
624
if (matchBit != bit)
625
break;
626
}
627
while (symbol < 0x100);
628
}
629
while (symbol < 0x100)
630
symbol = (symbol << 1) | RangeDec.DecodeBit(&probs[symbol]);
631
OutWindow.PutByte((Byte)(symbol - 0x100));
632
}
633
634
635
The match length decoding
636
-------------------------
637
638
The match length decoder returns normalized (zero-based value)
639
length of match. That value can be converted to real length of the match
640
with the following code:
641
642
#define kMatchMinLen 2
643
644
matchLen = len + kMatchMinLen;
645
646
The match length decoder can return the values from 0 to 271.
647
And the corresponded real match length values can be in the range
648
from 2 to 273.
649
650
The following scheme is used for the match length encoding:
651
652
Binary encoding Binary Tree structure Zero-based match length
653
sequence (binary + decimal):
654
655
0 xxx LowCoder[posState] xxx
656
1 0 yyy MidCoder[posState] yyy + 8
657
1 1 zzzzzzzz HighCoder zzzzzzzz + 16
658
659
LZMA uses bit model variable "Choice" to decode the first selection bit.
660
661
If the first selection bit is equal to 0, the decoder uses binary tree
662
LowCoder[posState] to decode 3-bit zero-based match length (xxx).
663
664
If the first selection bit is equal to 1, the decoder uses bit model
665
variable "Choice2" to decode the second selection bit.
666
667
If the second selection bit is equal to 0, the decoder uses binary tree
668
MidCoder[posState] to decode 3-bit "yyy" value, and zero-based match
669
length is equal to (yyy + 8).
670
671
If the second selection bit is equal to 1, the decoder uses binary tree
672
HighCoder to decode 8-bit "zzzzzzzz" value, and zero-based
673
match length is equal to (zzzzzzzz + 16).
674
675
LZMA uses "posState" value as context to select the binary tree
676
from LowCoder and MidCoder binary tree arrays:
677
678
unsigned posState = OutWindow.TotalPos & ((1 << pb) - 1);
679
680
The full code of the length decoder:
681
682
class CLenDecoder
683
{
684
CProb Choice;
685
CProb Choice2;
686
CBitTreeDecoder<3> LowCoder[1 << kNumPosBitsMax];
687
CBitTreeDecoder<3> MidCoder[1 << kNumPosBitsMax];
688
CBitTreeDecoder<8> HighCoder;
689
690
public:
691
692
void Init()
693
{
694
Choice = PROB_INIT_VAL;
695
Choice2 = PROB_INIT_VAL;
696
HighCoder.Init();
697
for (unsigned i = 0; i < (1 << kNumPosBitsMax); i++)
698
{
699
LowCoder[i].Init();
700
MidCoder[i].Init();
701
}
702
}
703
704
unsigned Decode(CRangeDecoder *rc, unsigned posState)
705
{
706
if (rc->DecodeBit(&Choice) == 0)
707
return LowCoder[posState].Decode(rc);
708
if (rc->DecodeBit(&Choice2) == 0)
709
return 8 + MidCoder[posState].Decode(rc);
710
return 16 + HighCoder.Decode(rc);
711
}
712
};
713
714
The LZMA decoder uses two instances of CLenDecoder class.
715
The first instance is for the matches of "Simple Match" type,
716
and the second instance is for the matches of "Rep Match" type:
717
718
CLenDecoder LenDecoder;
719
CLenDecoder RepLenDecoder;
720
721
722
The match distance decoding
723
---------------------------
724
725
LZMA supports dictionary sizes up to 4 GiB minus 1.
726
The value of match distance (decoded by distance decoder) can be
727
from 1 to 2^32. But the distance value that is equal to 2^32 is used to
728
indicate the "End of stream" marker. So real largest match distance
729
that is used for LZ-window match is (2^32 - 1).
730
731
LZMA uses normalized match length (zero-based length)
732
to calculate the context state "lenState" do decode the distance value:
733
734
#define kNumLenToPosStates 4
735
736
unsigned lenState = len;
737
if (lenState > kNumLenToPosStates - 1)
738
lenState = kNumLenToPosStates - 1;
739
740
The distance decoder returns the "dist" value that is zero-based value
741
of match distance. The real match distance can be calculated with the
742
following code:
743
744
matchDistance = dist + 1;
745
746
The state of the distance decoder and the initialization code:
747
748
#define kEndPosModelIndex 14
749
#define kNumFullDistances (1 << (kEndPosModelIndex >> 1))
750
#define kNumAlignBits 4
751
752
CBitTreeDecoder<6> PosSlotDecoder[kNumLenToPosStates];
753
CProb PosDecoders[1 + kNumFullDistances - kEndPosModelIndex];
754
CBitTreeDecoder<kNumAlignBits> AlignDecoder;
755
756
void InitDist()
757
{
758
for (unsigned i = 0; i < kNumLenToPosStates; i++)
759
PosSlotDecoder[i].Init();
760
AlignDecoder.Init();
761
INIT_PROBS(PosDecoders);
762
}
763
764
At first stage the distance decoder decodes 6-bit "posSlot" value with bit
765
tree decoder from PosSlotDecoder array. It's possible to get 2^6=64 different
766
"posSlot" values.
767
768
unsigned posSlot = PosSlotDecoder[lenState].Decode(&RangeDec);
769
770
The encoding scheme for distance value is shown in the following table:
771
772
posSlot (decimal) /
773
zero-based distance (binary)
774
0 0
775
1 1
776
2 10
777
3 11
778
779
4 10 x
780
5 11 x
781
6 10 xx
782
7 11 xx
783
8 10 xxx
784
9 11 xxx
785
10 10 xxxx
786
11 11 xxxx
787
12 10 xxxxx
788
13 11 xxxxx
789
790
14 10 yy zzzz
791
15 11 yy zzzz
792
16 10 yyy zzzz
793
17 11 yyy zzzz
794
...
795
62 10 yyyyyyyyyyyyyyyyyyyyyyyyyy zzzz
796
63 11 yyyyyyyyyyyyyyyyyyyyyyyyyy zzzz
797
798
where
799
"x ... x" means the sequence of binary symbols encoded with binary tree and
800
"Reverse" scheme. It uses separated binary tree for each posSlot from 4 to 13.
801
"y" means direct bit encoded with range coder.
802
"zzzz" means the sequence of four binary symbols encoded with binary
803
tree with "Reverse" scheme, where one common binary tree "AlignDecoder"
804
is used for all posSlot values.
805
806
If (posSlot < 4), the "dist" value is equal to posSlot value.
807
808
If (posSlot >= 4), the decoder uses "posSlot" value to calculate the value of
809
the high bits of "dist" value and the number of the low bits.
810
811
If (4 <= posSlot < kEndPosModelIndex), the decoder uses bit tree decoders.
812
(one separated bit tree decoder per one posSlot value) and "Reverse" scheme.
813
In this implementation we use one CProb array "PosDecoders" that contains
814
all CProb variables for all these bit decoders.
815
816
if (posSlot >= kEndPosModelIndex), the middle bits are decoded as direct
817
bits from RangeDecoder and the low 4 bits are decoded with a bit tree
818
decoder "AlignDecoder" with "Reverse" scheme.
819
820
The code to decode zero-based match distance:
821
822
unsigned DecodeDistance(unsigned len)
823
{
824
unsigned lenState = len;
825
if (lenState > kNumLenToPosStates - 1)
826
lenState = kNumLenToPosStates - 1;
827
828
unsigned posSlot = PosSlotDecoder[lenState].Decode(&RangeDec);
829
if (posSlot < 4)
830
return posSlot;
831
832
unsigned numDirectBits = (unsigned)((posSlot >> 1) - 1);
833
UInt32 dist = ((2 | (posSlot & 1)) << numDirectBits);
834
if (posSlot < kEndPosModelIndex)
835
dist += BitTreeReverseDecode(PosDecoders + dist - posSlot, numDirectBits, &RangeDec);
836
else
837
{
838
dist += RangeDec.DecodeDirectBits(numDirectBits - kNumAlignBits) << kNumAlignBits;
839
dist += AlignDecoder.ReverseDecode(&RangeDec);
840
}
841
return dist;
842
}
843
844
845
846
LZMA Decoding modes
847
-------------------
848
849
There are 2 types of LZMA streams:
850
851
1) The stream with "End of stream" marker.
852
2) The stream without "End of stream" marker.
853
854
And the LZMA Decoder supports 3 modes of decoding:
855
856
1) The unpack size is undefined. The LZMA decoder stops decoding after
857
getting "End of stream" marker.
858
The input variables for that case:
859
860
markerIsMandatory = true
861
unpackSizeDefined = false
862
unpackSize contains any value
863
864
2) The unpack size is defined and LZMA decoder supports both variants,
865
where the stream can contain "End of stream" marker or the stream is
866
finished without "End of stream" marker. The LZMA decoder must detect
867
any of these situations.
868
The input variables for that case:
869
870
markerIsMandatory = false
871
unpackSizeDefined = true
872
unpackSize contains unpack size
873
874
3) The unpack size is defined and the LZMA stream must contain
875
"End of stream" marker
876
The input variables for that case:
877
878
markerIsMandatory = true
879
unpackSizeDefined = true
880
unpackSize contains unpack size
881
882
883
The main loop of decoder
884
------------------------
885
886
The main loop of LZMA decoder:
887
888
Initialize the LZMA state.
889
loop
890
{
891
// begin of loop
892
Check "end of stream" conditions.
893
Decode Type of MATCH / LITERAL.
894
If it's LITERAL, decode LITERAL value and put the LITERAL to Window.
895
If it's MATCH, decode the length of match and the match distance.
896
Check error conditions, check end of stream conditions and copy
897
the sequence of match bytes from sliding window to current position
898
in window.
899
Go to begin of loop
900
}
901
902
The reference implementation of LZMA decoder uses "unpackSize" variable
903
to keep the number of remaining bytes in output stream. So it reduces
904
"unpackSize" value after each decoded LITERAL or MATCH.
905
906
The following code contains the "end of stream" condition check at the start
907
of the loop:
908
909
if (unpackSizeDefined && unpackSize == 0 && !markerIsMandatory)
910
if (RangeDec.IsFinishedOK())
911
return LZMA_RES_FINISHED_WITHOUT_MARKER;
912
913
LZMA uses three types of matches:
914
915
1) "Simple Match" - the match with distance value encoded with bit models.
916
917
2) "Rep Match" - the match that uses the distance from distance
918
history table.
919
920
3) "Short Rep Match" - the match of single byte length, that uses the latest
921
distance from distance history table.
922
923
The LZMA decoder keeps the history of latest 4 match distances that were used
924
by decoder. That set of 4 variables contains zero-based match distances and
925
these variables are initialized with zero values:
926
927
UInt32 rep0 = 0, rep1 = 0, rep2 = 0, rep3 = 0;
928
929
The LZMA decoder uses binary model variables to select type of MATCH or LITERAL:
930
931
#define kNumStates 12
932
#define kNumPosBitsMax 4
933
934
CProb IsMatch[kNumStates << kNumPosBitsMax];
935
CProb IsRep[kNumStates];
936
CProb IsRepG0[kNumStates];
937
CProb IsRepG1[kNumStates];
938
CProb IsRepG2[kNumStates];
939
CProb IsRep0Long[kNumStates << kNumPosBitsMax];
940
941
The decoder uses "state" variable value to select exact variable
942
from "IsRep", "IsRepG0", "IsRepG1" and "IsRepG2" arrays.
943
The "state" variable can get the value from 0 to 11.
944
Initial value for "state" variable is zero:
945
946
unsigned state = 0;
947
948
The "state" variable is updated after each LITERAL or MATCH with one of the
949
following functions:
950
951
unsigned UpdateState_Literal(unsigned state)
952
{
953
if (state < 4) return 0;
954
else if (state < 10) return state - 3;
955
else return state - 6;
956
}
957
unsigned UpdateState_Match (unsigned state) { return state < 7 ? 7 : 10; }
958
unsigned UpdateState_Rep (unsigned state) { return state < 7 ? 8 : 11; }
959
unsigned UpdateState_ShortRep(unsigned state) { return state < 7 ? 9 : 11; }
960
961
The decoder calculates "state2" variable value to select exact variable from
962
"IsMatch" and "IsRep0Long" arrays:
963
964
unsigned posState = OutWindow.TotalPos & ((1 << pb) - 1);
965
unsigned state2 = (state << kNumPosBitsMax) + posState;
966
967
The decoder uses the following code flow scheme to select exact
968
type of LITERAL or MATCH:
969
970
IsMatch[state2] decode
971
0 - the Literal
972
1 - the Match
973
IsRep[state] decode
974
0 - Simple Match
975
1 - Rep Match
976
IsRepG0[state] decode
977
0 - the distance is rep0
978
IsRep0Long[state2] decode
979
0 - Short Rep Match
980
1 - Rep Match 0
981
1 -
982
IsRepG1[state] decode
983
0 - Rep Match 1
984
1 -
985
IsRepG2[state] decode
986
0 - Rep Match 2
987
1 - Rep Match 3
988
989
990
LITERAL symbol
991
--------------
992
If the value "0" was decoded with IsMatch[state2] decoding, we have "LITERAL" type.
993
994
At first the LZMA decoder must check that it doesn't exceed
995
specified uncompressed size:
996
997
if (unpackSizeDefined && unpackSize == 0)
998
return LZMA_RES_ERROR;
999
1000
Then it decodes literal value and puts it to sliding window:
1001
1002
DecodeLiteral(state, rep0);
1003
1004
Then the decoder must update the "state" value and "unpackSize" value;
1005
1006
state = UpdateState_Literal(state);
1007
unpackSize--;
1008
1009
Then the decoder must go to the begin of main loop to decode next Match or Literal.
1010
1011
1012
Simple Match
1013
------------
1014
1015
If the value "1" was decoded with IsMatch[state2] decoding,
1016
we have the "Simple Match" type.
1017
1018
The distance history table is updated with the following scheme:
1019
1020
rep3 = rep2;
1021
rep2 = rep1;
1022
rep1 = rep0;
1023
1024
The zero-based length is decoded with "LenDecoder":
1025
1026
len = LenDecoder.Decode(&RangeDec, posState);
1027
1028
The state is update with UpdateState_Match function:
1029
1030
state = UpdateState_Match(state);
1031
1032
and the new "rep0" value is decoded with DecodeDistance:
1033
1034
rep0 = DecodeDistance(len);
1035
1036
That "rep0" will be used as zero-based distance for current match.
1037
1038
If the value of "rep0" is equal to 0xFFFFFFFF, it means that we have
1039
"End of stream" marker, so we can stop decoding and check finishing
1040
condition in Range Decoder:
1041
1042
if (rep0 == 0xFFFFFFFF)
1043
return RangeDec.IsFinishedOK() ?
1044
LZMA_RES_FINISHED_WITH_MARKER :
1045
LZMA_RES_ERROR;
1046
1047
If uncompressed size is defined, LZMA decoder must check that it doesn't
1048
exceed that specified uncompressed size:
1049
1050
if (unpackSizeDefined && unpackSize == 0)
1051
return LZMA_RES_ERROR;
1052
1053
Also the decoder must check that "rep0" value is not larger than dictionary size
1054
and is not larger than the number of already decoded bytes:
1055
1056
if (rep0 >= dictSize || !OutWindow.CheckDistance(rep0))
1057
return LZMA_RES_ERROR;
1058
1059
Then the decoder must copy match bytes as described in
1060
"The match symbols copying" section.
1061
1062
1063
Rep Match
1064
---------
1065
1066
If the LZMA decoder has decoded the value "1" with IsRep[state] variable,
1067
we have "Rep Match" type.
1068
1069
At first the LZMA decoder must check that it doesn't exceed
1070
specified uncompressed size:
1071
1072
if (unpackSizeDefined && unpackSize == 0)
1073
return LZMA_RES_ERROR;
1074
1075
Also the decoder must return error, if the LZ window is empty:
1076
1077
if (OutWindow.IsEmpty())
1078
return LZMA_RES_ERROR;
1079
1080
If the match type is "Rep Match", the decoder uses one of the 4 variables of
1081
distance history table to get the value of distance for current match.
1082
And there are 4 corresponding ways of decoding flow.
1083
1084
The decoder updates the distance history with the following scheme
1085
depending from type of match:
1086
1087
- "Rep Match 0" or "Short Rep Match":
1088
; LZMA doesn't update the distance history
1089
1090
- "Rep Match 1":
1091
UInt32 dist = rep1;
1092
rep1 = rep0;
1093
rep0 = dist;
1094
1095
- "Rep Match 2":
1096
UInt32 dist = rep2;
1097
rep2 = rep1;
1098
rep1 = rep0;
1099
rep0 = dist;
1100
1101
- "Rep Match 3":
1102
UInt32 dist = rep3;
1103
rep3 = rep2;
1104
rep2 = rep1;
1105
rep1 = rep0;
1106
rep0 = dist;
1107
1108
Then the decoder decodes exact subtype of "Rep Match" using "IsRepG0", "IsRep0Long",
1109
"IsRepG1", "IsRepG2".
1110
1111
If the subtype is "Short Rep Match", the decoder updates the state, puts
1112
the one byte from window to current position in window and goes to next
1113
MATCH/LITERAL symbol (the begin of main loop):
1114
1115
state = UpdateState_ShortRep(state);
1116
OutWindow.PutByte(OutWindow.GetByte(rep0 + 1));
1117
unpackSize--;
1118
continue;
1119
1120
In other cases (Rep Match 0/1/2/3), it decodes the zero-based
1121
length of match with "RepLenDecoder" decoder:
1122
1123
len = RepLenDecoder.Decode(&RangeDec, posState);
1124
1125
Then it updates the state:
1126
1127
state = UpdateState_Rep(state);
1128
1129
Then the decoder must copy match bytes as described in
1130
"The Match symbols copying" section.
1131
1132
1133
The match symbols copying
1134
-------------------------
1135
1136
If we have the match (Simple Match or Rep Match 0/1/2/3), the decoder must
1137
copy the sequence of bytes with calculated match distance and match length.
1138
If uncompressed size is defined, LZMA decoder must check that it doesn't
1139
exceed that specified uncompressed size:
1140
1141
len += kMatchMinLen;
1142
bool isError = false;
1143
if (unpackSizeDefined && unpackSize < len)
1144
{
1145
len = (unsigned)unpackSize;
1146
isError = true;
1147
}
1148
OutWindow.CopyMatch(rep0 + 1, len);
1149
unpackSize -= len;
1150
if (isError)
1151
return LZMA_RES_ERROR;
1152
1153
Then the decoder must go to the begin of main loop to decode next MATCH or LITERAL.
1154
1155
1156
1157
NOTES
1158
-----
1159
1160
This specification doesn't describe the variant of decoder implementation
1161
that supports partial decoding. Such partial decoding case can require some
1162
changes in "end of stream" condition checks code. Also such code
1163
can use additional status codes, returned by decoder.
1164
1165
This specification uses C++ code with templates to simplify describing.
1166
The optimized version of LZMA decoder doesn't need templates.
1167
Such optimized version can use just two arrays of CProb variables:
1168
1) The dynamic array of CProb variables allocated for the Literal Decoder.
1169
2) The one common array that contains all other CProb variables.
1170
1171
1172
References:
1173
1174
1. G. N. N. Martin, Range encoding: an algorithm for removing redundancy
1175
from a digitized message, Video & Data Recording Conference,
1176
Southampton, UK, July 24-27, 1979.
1177
1178