Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
sagemath
GitHub Repository: sagemath/sagelib
Path: blob/master/sage/coding/source_coding/huffman.py
4038 views
1
r"""
2
Huffman Encoding
3
4
This module implements functionalities relating to Huffman encoding and
5
decoding.
6
7
AUTHOR:
8
9
- Nathann Cohen (2010-05): initial version.
10
11
12
Classes and functions
13
=====================
14
"""
15
16
###########################################################################
17
# Copyright (c) 2010 Nathann Cohen <[email protected]>
18
#
19
# This program is free software; you can redistribute it and/or modify
20
# it under the terms of the GNU General Public License as published by
21
# the Free Software Foundation; either version 2 of the License, or
22
# (at your option) any later version.
23
#
24
# This program is distributed in the hope that it will be useful,
25
# but WITHOUT ANY WARRANTY; without even the implied warranty of
26
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
27
# GNU General Public License for more details.
28
#
29
# http://www.gnu.org/licenses/
30
###########################################################################
31
32
from sage.structure.sage_object import SageObject
33
34
###########################################################################
35
#
36
# Helper functions
37
#
38
###########################################################################
39
40
def frequency_table(string):
41
r"""
42
Return the frequency table corresponding to the given string.
43
44
INPUT:
45
46
- ``string`` -- a string of symbols over some alphabet.
47
48
OUTPUT:
49
50
- A table of frequency of each unique symbol in ``string``. If ``string``
51
is an empty string, return an empty table.
52
53
EXAMPLES:
54
55
The frequency table of a non-empty string::
56
57
sage: from sage.coding.source_coding.huffman import frequency_table
58
sage: str = "Stop counting my characters!"
59
sage: T = sorted(frequency_table(str).items())
60
sage: for symbol, code in T:
61
... print symbol, code
62
...
63
3
64
! 1
65
S 1
66
a 2
67
c 3
68
e 1
69
g 1
70
h 1
71
i 1
72
m 1
73
n 2
74
o 2
75
p 1
76
r 2
77
s 1
78
t 3
79
u 1
80
y 1
81
82
The frequency of an empty string::
83
84
sage: frequency_table("")
85
{}
86
"""
87
d = {}
88
for s in string:
89
d[s] = d.get(s, 0) + 1
90
return d
91
92
class Huffman(SageObject):
93
r"""
94
This class implements the basic functionalities of Huffman codes.
95
96
It can build a Huffman code from a given string, or from the information
97
of a dictionary associating to each key (the elements of the alphabet) a
98
weight (most of the time, a probability value or a number of occurrences).
99
100
INPUT:
101
102
- ``source`` -- can be either
103
104
- A string from which the Huffman encoding should be created.
105
106
- A dictionary that associates to each symbol of an alphabet a numeric
107
value. If we consider the frequency of each alphabetic symbol, then
108
``source`` is considered as the frequency table of the alphabet with
109
each numeric (non-negative integer) value being the number of
110
occurrences of a symbol. The numeric values can also represent weights
111
of the symbols. In that case, the numeric values are not necessarily
112
integers, but can be real numbers.
113
114
In order to construct a Huffman code for an alphabet, we use exactly one of
115
the following methods:
116
117
#. Let ``source`` be a string of symbols over an alphabet and feed
118
``source`` to the constructor of this class. Based on the input string, a
119
frequency table is constructed that contains the frequency of each unique
120
symbol in ``source``. The alphabet in question is then all the unique
121
symbols in ``source``. A significant implication of this is that any
122
subsequent string that we want to encode must contain only symbols that
123
can be found in ``source``.
124
125
#. Let ``source`` be the frequency table of an alphabet. We can feed this
126
table to the constructor of this class. The table ``source`` can be a
127
table of frequencies or a table of weights.
128
129
Examples::
130
131
sage: from sage.coding.source_coding.huffman import Huffman, frequency_table
132
sage: h1 = Huffman("There once was a french fry")
133
sage: for letter, code in h1.encoding_table().iteritems():
134
... print "'"+ letter + "' : " + code
135
'a' : 0111
136
' ' : 00
137
'c' : 1010
138
'e' : 100
139
'f' : 1011
140
'h' : 1100
141
'o' : 11100
142
'n' : 1101
143
's' : 11101
144
'r' : 010
145
'T' : 11110
146
'w' : 11111
147
'y' : 0110
148
149
We can obtain the same result by "training" the Huffman code with the
150
following table of frequency::
151
152
sage: ft = frequency_table("There once was a french fry"); ft
153
{'a': 2, ' ': 5, 'c': 2, 'e': 4, 'f': 2, 'h': 2, 'o': 1, 'n': 2, 's': 1, 'r': 3, 'T': 1, 'w': 1, 'y': 1}
154
sage: h2 = Huffman(ft)
155
156
Once ``h1`` has been trained, and hence possesses an encoding table,
157
it is possible to obtain the Huffman encoding of any string
158
(possibly the same) using this code::
159
160
sage: encoded = h1.encode("There once was a french fry"); encoded
161
'11110110010001010000111001101101010000111110111111010001110010110101001101101011000010110100110'
162
163
We can decode the above encoded string in the following way::
164
165
sage: h1.decode(encoded)
166
'There once was a french fry'
167
168
Obviously, if we try to decode a string using a Huffman instance which
169
has been trained on a different sample (and hence has a different encoding
170
table), we are likely to get some random-looking string::
171
172
sage: h3 = Huffman("There once were two french fries")
173
sage: h3.decode(encoded)
174
' wehnefetrhft ne ewrowrirTc'
175
176
This does not look like our original string.
177
178
Instead of using frequency, we can assign weights to each alphabetic
179
symbol::
180
181
sage: from sage.coding.source_coding.huffman import Huffman
182
sage: T = {"a":45, "b":13, "c":12, "d":16, "e":9, "f":5}
183
sage: H = Huffman(T)
184
sage: L = ["deaf", "bead", "fab", "bee"]
185
sage: E = []
186
sage: for e in L:
187
... E.append(H.encode(e))
188
... print E[-1]
189
...
190
111110101100
191
10111010111
192
11000101
193
10111011101
194
sage: D = []
195
sage: for e in E:
196
... D.append(H.decode(e))
197
... print D[-1]
198
...
199
deaf
200
bead
201
fab
202
bee
203
sage: D == L
204
True
205
"""
206
207
def __init__(self, source):
208
r"""
209
Constructor for Huffman.
210
211
See the docstring of this class for full documentation.
212
213
EXAMPLES::
214
215
sage: from sage.coding.source_coding.huffman import Huffman
216
sage: str = "Sage is my most favorite general purpose computer algebra system"
217
sage: h = Huffman(str)
218
219
TESTS:
220
221
Feeding anything else than a string or a dictionary::
222
223
sage: Huffman(Graph())
224
Traceback (most recent call last):
225
...
226
ValueError: Input must be either a string or a dictionary.
227
"""
228
229
# alphabetic symbol to Huffman encoding translation table
230
self._character_to_code = []
231
# Huffman binary tree
232
self._tree = None
233
# index of each alphabetic symbol
234
self._index = None
235
236
if isinstance(source,basestring):
237
self._build_code(frequency_table(source))
238
elif isinstance(source, dict):
239
self._build_code(source)
240
else:
241
raise ValueError("Input must be either a string or a dictionary.")
242
243
def _build_code_from_tree(self, tree, d, prefix):
244
r"""
245
Builds the Huffman code corresponding to a given tree and prefix.
246
247
INPUT:
248
249
- ``tree`` -- integer, or list of size `2`
250
251
- ``d`` -- the dictionary to fill
252
253
- ``prefix`` (string) -- binary string which is the prefix
254
of any element of the tree
255
256
EXAMPLES::
257
258
sage: from sage.coding.source_coding.huffman import Huffman
259
sage: str = "Sage is my most favorite general purpose computer algebra system"
260
sage: h = Huffman(str)
261
sage: d = {}
262
sage: h._build_code_from_tree(h._tree, d, prefix="")
263
"""
264
# This is really a recursive construction of a Huffman code. By
265
# feeding this class a sufficiently large alphabet, it is possible to
266
# exceed the maximum recursion depth and hence result in a RuntimeError.
267
try:
268
self._build_code_from_tree(tree[0],
269
d,
270
prefix="".join([prefix, "0"]))
271
self._build_code_from_tree(tree[1],
272
d,
273
prefix="".join([prefix, "1"]))
274
except TypeError:
275
d[tree] = prefix
276
277
def _build_code(self, dic):
278
r"""
279
Constructs a Huffman code corresponding to an alphabet with the given
280
weight table.
281
282
INPUT:
283
284
- ``dic`` -- a dictionary that associates to each symbol of an alphabet
285
a numeric value. If we consider the frequency of each alphabetic
286
symbol, then ``dic`` is considered as the frequency table of the
287
alphabet with each numeric (non-negative integer) value being the
288
number of occurrences of a symbol. The numeric values can also
289
represent weights of the symbols. In that case, the numeric values
290
are not necessarily integers, but can be real numbers. In general,
291
we refer to ``dic`` as a weight table.
292
293
EXAMPLE::
294
295
sage: from sage.coding.source_coding.huffman import Huffman, frequency_table
296
sage: str = "Sage is my most favorite general purpose computer algebra system"
297
sage: h = Huffman(str)
298
sage: d = {}
299
sage: h._build_code(frequency_table(str))
300
"""
301
from heapq import heappush, heappop
302
heap = []
303
# Each alphabetic symbol is now represented by an element with
304
# weight w and index i.
305
for i, (s, w) in enumerate(dic.items()):
306
heappush(heap, (w, i))
307
for i in range(1, len(dic)):
308
weight_a, node_a = heappop(heap)
309
weight_b, node_b = heappop(heap)
310
heappush(heap, (weight_a + weight_b, [node_a, node_b]))
311
# dictionary of symbol to Huffman encoding
312
d = {}
313
self._tree = heap[0][1]
314
# Build the binary tree of a Huffman code, where the root of the tree
315
# is associated with the empty string.
316
self._build_code_from_tree(self._tree, d, prefix="")
317
self._index = dict((i, s) for i, (s, w) in enumerate(dic.items()))
318
self._character_to_code = dict(
319
(s, d[i]) for i, (s, w) in enumerate(dic.items()))
320
321
def encode(self, string):
322
r"""
323
Encode the given string based on the current encoding table.
324
325
INPUT:
326
327
- ``string`` -- a string of symbols over an alphabet.
328
329
OUTPUT:
330
331
- A Huffman encoding of ``string``.
332
333
EXAMPLES:
334
335
This is how a string is encoded and then decoded::
336
337
sage: from sage.coding.source_coding.huffman import Huffman
338
sage: str = "Sage is my most favorite general purpose computer algebra system"
339
sage: h = Huffman(str)
340
sage: encoded = h.encode(str); encoded
341
'00000110100010101011000011101010011100101010011011011100111101110010110100001011011111000001110101010001010110011010111111011001110100101000111110010011011100101011100000110001100101000101110101111101110110011000101011000111111101101111010010111001110100011'
342
sage: h.decode(encoded)
343
'Sage is my most favorite general purpose computer algebra system'
344
"""
345
if self._character_to_code:
346
return "".join(map(lambda x: self._character_to_code[x], string))
347
348
def decode(self, string):
349
r"""
350
Decode the given string using the current encoding table.
351
352
INPUT:
353
354
- ``string`` -- a string of Huffman encodings.
355
356
OUTPUT:
357
358
- The Huffman decoding of ``string``.
359
360
EXAMPLES:
361
362
This is how a string is encoded and then decoded::
363
364
sage: from sage.coding.source_coding.huffman import Huffman
365
sage: str = "Sage is my most favorite general purpose computer algebra system"
366
sage: h = Huffman(str)
367
sage: encoded = h.encode(str); encoded
368
'00000110100010101011000011101010011100101010011011011100111101110010110100001011011111000001110101010001010110011010111111011001110100101000111110010011011100101011100000110001100101000101110101111101110110011000101011000111111101101111010010111001110100011'
369
sage: h.decode(encoded)
370
'Sage is my most favorite general purpose computer algebra system'
371
372
TESTS:
373
374
Of course, the string one tries to decode has to be a binary one. If
375
not, an exception is raised::
376
377
sage: h.decode('I clearly am not a binary string')
378
Traceback (most recent call last):
379
...
380
ValueError: Input must be a binary string.
381
"""
382
# This traverses the whole Huffman binary tree in order to work out
383
# the symbol represented by a stream of binaries. This method of
384
# decoding is really slow. A faster method is needed.
385
# TODO: faster decoding implementation
386
chars = []
387
tree = self._tree
388
index = self._index
389
for i in string:
390
if i == "0":
391
tree = tree[0]
392
elif i == "1":
393
tree = tree[1]
394
else:
395
raise ValueError("Input must be a binary string.")
396
if not isinstance(tree, list):
397
chars.append(index[tree])
398
tree = self._tree
399
return "".join(chars)
400
401
def encoding_table(self):
402
r"""
403
Returns the current encoding table.
404
405
INPUT:
406
407
- None.
408
409
OUTPUT:
410
411
- A dictionary associating an alphabetic symbol to a Huffman encoding.
412
413
EXAMPLES::
414
415
sage: from sage.coding.source_coding.huffman import Huffman
416
sage: str = "Sage is my most favorite general purpose computer algebra system"
417
sage: h = Huffman(str)
418
sage: T = sorted(h.encoding_table().items())
419
sage: for symbol, code in T:
420
... print symbol, code
421
...
422
101
423
S 00000
424
a 1101
425
b 110001
426
c 110000
427
e 010
428
f 110010
429
g 0001
430
i 10000
431
l 10011
432
m 0011
433
n 110011
434
o 0110
435
p 0010
436
r 1111
437
s 1110
438
t 0111
439
u 10001
440
v 00001
441
y 10010
442
"""
443
return self._character_to_code.copy()
444
445
def tree(self):
446
r"""
447
Returns the Huffman tree corresponding to the current encoding.
448
449
INPUT:
450
451
- None.
452
453
OUTPUT:
454
455
- The binary tree representing a Huffman code.
456
457
EXAMPLES::
458
459
sage: from sage.coding.source_coding.huffman import Huffman
460
sage: str = "Sage is my most favorite general purpose computer algebra system"
461
sage: h = Huffman(str)
462
sage: T = h.tree(); T
463
Digraph on 39 vertices
464
sage: T.show(figsize=[20,20])
465
<BLANKLINE>
466
"""
467
from sage.graphs.digraph import DiGraph
468
g = DiGraph()
469
g.add_edges(self._generate_edges(self._tree))
470
return g
471
472
def _generate_edges(self, tree, parent="", bit=""):
473
"""
474
Generate the edges of the given Huffman tree.
475
476
INPUT:
477
478
- ``tree`` -- a Huffman binary tree.
479
480
- ``parent`` -- (default: empty string) a parent vertex with exactly
481
two children.
482
483
- ``bit`` -- (default: empty string) the bit signifying either the
484
left or right branch. The bit "0" denotes the left branch and "1"
485
denotes the right branch.
486
487
OUTPUT:
488
489
- An edge list of the Huffman binary tree.
490
491
EXAMPLES::
492
493
sage: from sage.coding.source_coding.huffman import Huffman
494
sage: H = Huffman("Sage")
495
sage: T = H.tree()
496
sage: T.edges(labels=None)
497
[('0', 'S: 01'), ('0', 'a: 00'), ('1', 'e: 10'), ('1', 'g: 11'), ('root', '0'), ('root', '1')]
498
"""
499
if parent == "":
500
u = "root"
501
else:
502
u = parent
503
s = "".join([parent, bit])
504
try:
505
left = self._generate_edges(tree[0], parent=s, bit="0")
506
right = self._generate_edges(tree[1], parent=s, bit="1")
507
L = [(u, s)] if s != "" else []
508
return left + right + L
509
except TypeError:
510
return [(u, "".join([self.decode(s), ": ", s]))]
511
512