Huffman Text Compression Tool using Tree and Priority Queue

What Is Huffman Coding?



Huffman coding is a powerful algorithm for compressing data without losing information, also called "lossless compression". It replaces frequently recurring characters with shorter binary codes, and rare characters with longer ones, making files and data smaller in size. This technique is used in many modern compression formats.


Why Use Huffman Coding?



Traditional encoding (like ASCII) gives every character the same number of bits (usually 8). But if some characters appear way more often, there's an opportunity for optimisation. Huffman coding finds the most efficient way to assign variable-length codes so the overall data size is minimised.


How Does Huffman Coding Work?

Let’s break the process down:



1. Frequency Analysis

First, count the frequency of each character in the text. More frequent characters are “cheaper” to encode.

Example:

For the string ABRACADABRA:  

A: 5

B: 2

R: 2

C: 1

D: 1


2. Initialise Priority Queue

Put each character and its frequency into a priority queue (min-heap). The queue will help efficiently find the lowest frequencies.


3. Build the Huffman Tree

Repeat until the queue has just one tree left:

i) Remove (pop) the two nodes with the lowest frequency.

ii) Combine them into a new node where the frequency is the sum.

iii) Insert this node back into the queue.

iv) The tree grows as you combine nodes, and each merge operation records the structure needed for encoding.


4. Generate Codes with Tree Traversal

Assign binary codes by traversing from the root of the tree:

i) Moving left: add a 0

ii) Moving right: add a 1

This way, every character gets a unique code, and no code is the prefix of another (known as the “prefix property”), which enables clear, unambiguous decoding.


# Worked Example

Let’s revisit ABRACADABRA:

Frequencies

|Char|Freq|

|----|----|

|A|5|

|B|2|

|R|2|

|C|1|

|D|1|


Steps in Tree-Building:

1. Combine C(1) + D(1) =[6]

2. Combine B(2) + R(2) =[7]

3. Merge (C&D) + (B&R) =[8][6][7]

4. Merge A(5) + =[9][8]


Assigning Codes

|Char|Code|

|----|----|

|A|0|

|R|10|

|B|111|

|C|1100|

|D|1101|

Here, codes are shorter for common characters (A) and longer for rarer ones (C, D). Compression is achieved!


Where Are Trees and Priority Queues Used?



i) Tree: The structure for assigning codes and enabling decoding.

ii) Priority Queue: Enforces the merging order, always selecting the smallest weights first to guarantee optimality.

Advantages and Properties

i) Lossless:Original data can be recovered exactly.

  • Optimal (for given data): Always produces the smallest possible average code length for the input.
  • Prefix Property: Ensures that no code is a prefix of another, so decoding works instantly and unambiguously.


Real-World Applications


i) File compression standards (ZIP, GZIP)

ii) Image formats (PNG)

iii) Transmission protocols, fax compression, MP3 meta-data, and more.


# Example Python Snippet

python

import heapq

from collections import Counter


class Node:

    def __init__(self, char, freq):

        self.char = char

        self.freq = freq

        self.left = None

        self.right = None

    def __lt__(self, other):

        return self.freq < other.freq

# Frequency analysis

text = "ABRACADABRA"

freq = Counter(text)

# Priority Queue

queue = [Node(ch, freq[ch]) for ch in freq]

heapq.heapify(queue)

# Huffman tree construction

while len(queue) > 1:

    node1 = heapq.heappop(queue)

    node2 = heapq.heappop(queue)

    merged = Node(None, node1.freq + node2.freq)

    merged.left = node1

    merged.right = node2

    heapq.heappush(queue, merged)

This example demonstrates both the priority queue (heapq) and the tree (Node structure).


Conclusion

Huffman coding is a classic and essential algorithm that shows how data structures (trees, priority queues) can work together for real-world applications. It's not just theory you’re looking at the foundation of much of the file compression running every day on devices worldwide.

Comments

Popular posts from this blog