Huffman Text Compression Tool using Tree and Priority Queue
What Is Huffman Coding?
Huffman coding is a powerful algorithm for compressing data without losing information, also called "lossless compression". It replaces frequently recurring characters with shorter binary codes, and rare characters with longer ones, making files and data smaller in size. This technique is used in many modern compression formats.
Why Use Huffman Coding?
Traditional encoding (like ASCII) gives every character the same number of bits (usually 8). But if some characters appear way more often, there's an opportunity for optimisation. Huffman coding finds the most efficient way to assign variable-length codes so the overall data size is minimised.
How Does Huffman Coding Work?
Let’s break the process down:
1. Frequency Analysis
First, count the frequency of each character in the text. More frequent characters are “cheaper” to encode.
Example:
For the string ABRACADABRA:
A: 5
B: 2
R: 2
C: 1
D: 1
2. Initialise Priority Queue
Put each character and its frequency into a priority queue (min-heap). The queue will help efficiently find the lowest frequencies.
3. Build the Huffman Tree
Repeat until the queue has just one tree left:
i) Remove (pop) the two nodes with the lowest frequency.
ii) Combine them into a new node where the frequency is the sum.
iii) Insert this node back into the queue.
iv) The tree grows as you combine nodes, and each merge operation records the structure needed for encoding.
4. Generate Codes with Tree Traversal
Assign binary codes by traversing from the root of the tree:
i) Moving left: add a 0
ii) Moving right: add a 1
This way, every character gets a unique code, and no code is the prefix of another (known as the “prefix property”), which enables clear, unambiguous decoding.
# Worked Example
Let’s revisit ABRACADABRA:
Frequencies
|Char|Freq|
|----|----|
|A|5|
|B|2|
|R|2|
|C|1|
|D|1|
Steps in Tree-Building:
1. Combine C(1) + D(1) =[6]
2. Combine B(2) + R(2) =[7]
3. Merge (C&D) + (B&R) =[8][6][7]
4. Merge A(5) + =[9][8]
Assigning Codes
|Char|Code|
|----|----|
|A|0|
|R|10|
|B|111|
|C|1100|
|D|1101|
Here, codes are shorter for common characters (A) and longer for rarer ones (C, D). Compression is achieved!
Where Are Trees and Priority Queues Used?
i) Tree: The structure for assigning codes and enabling decoding.
ii) Priority Queue: Enforces the merging order, always selecting the smallest weights first to guarantee optimality.
Advantages and Properties
i) Lossless:Original data can be recovered exactly.
- Optimal (for given data): Always produces the smallest possible average code length for the input.
- Prefix Property: Ensures that no code is a prefix of another, so decoding works instantly and unambiguously.
Real-World Applications
i) File compression standards (ZIP, GZIP)
ii) Image formats (PNG)
iii) Transmission protocols, fax compression, MP3 meta-data, and more.
# Example Python Snippet
python
import heapq
from collections import Counter
class Node:
def __init__(self, char, freq):
self.char = char
self.freq = freq
self.left = None
self.right = None
def __lt__(self, other):
return self.freq < other.freq
# Frequency analysis
text = "ABRACADABRA"
freq = Counter(text)
# Priority Queue
queue = [Node(ch, freq[ch]) for ch in freq]
heapq.heapify(queue)
# Huffman tree construction
while len(queue) > 1:
node1 = heapq.heappop(queue)
node2 = heapq.heappop(queue)
merged = Node(None, node1.freq + node2.freq)
merged.left = node1
merged.right = node2
heapq.heappush(queue, merged)
This example demonstrates both the priority queue (heapq) and the tree (Node structure).
Conclusion
Huffman coding is a classic and essential algorithm that shows how data structures (trees, priority queues) can work together for real-world applications. It's not just theory you’re looking at the foundation of much of the file compression running every day on devices worldwide.
Comments
Post a Comment