数据结构实验 - Huffman编码

项目说明

本次实验项目是一个简单的Huffman编码工具，作了两种不同的实现。
由于在前面的课程中实现了Huffman树的C++类，所以这个项目实现起来相当容易，对应hufftree.cpp中的代码，但是该方法并没有用到书上的Huffman Table，而是直接利用指针的形式构建出了一颗树，因此没有去完成项目要求的输出Table的终态。
因此，为完成上述的要求，又撰写了一份与书上代码更接近的版本放在hufftable.cpp中，实现了Table的输出。

Huffman Tree

对于链式的Huffman树，我们可以有以下的节点定义：

struct Node {
    char symbol;
    int frequency;
    Node* left;
    Node* right;

    Node(char sym, int freq) : symbol(sym), frequency(freq), left(nullptr), right(nullptr) {}
};

Huffman树的定义：

class HuffmanTree {
private:
    Node* root;
    // Huffman 编码表
    unordered_map<char, string> codes;

	// 构造函数
    void buildCodes(Node* node, string code="") {}

public:
    HuffmanTree(const vector<pair<char, int>>& frequencies) {}

    // encode : 在编码表中查找
    string encode(const string& text) {}

    // decode : 遍历树并解码
    string decode(const string& encoded) {}

    void showCodes() {}

private:
    // 频率比较
    struct CompareFrequency {};
};

接下来，主要是重点分析一些比较重要的功能实现：

树的构造

HuffmanTree(const vector<pair<char, int>>& frequencies) {
    // 利用优先队列构建Huffman树
    priority_queue<Node*, vector<Node*>, CompareFrequency> pq;

    for (const auto& pair : frequencies) {
        pq.push(new Node(pair.first, pair.second));
    }

    while (pq.size() > 1) {
        Node* left = pq.top(); pq.pop();
        Node* right = pq.top(); pq.pop();

        Node* parent = new Node('\0', left->frequency + right->frequency);
        parent->left = left;
        parent->right = right;

        pq.push(parent);
    }

    root = pq.top();
    buildCodes(root);
}

在树的构造中，由于涉及到获取最小的节点，我在这里使用了优先队列可以方便的完成排序的操作，进而可以很容易的完成树的构建，不过使用优先队列需要自己定义一个CompareFrequency类，用于比较两个节点的大小。

编码

可以看到在树的构造函数中，最后一步是调用buildCodes()函数，在codes中存储的是所涉及到各个字符的编码，这样就不必在树中回溯来完成编码。

void buildCodes(Node* node, string code="") {
        if (!node) return;

        if (node->symbol != '\0') {
            codes[node->symbol] = code;
        } else {
            buildCodes(node->left, code + "0");
            buildCodes(node->right, code + "1");
        }
    }

直接利用递归就可以完成编码表的构建啦～

那么编码的实现非常简单，只需要遍历输入的文本，然后根据编码表找到对应的编码，然后拼接起来即可。

string encode(const string& text) {
    string encoded;
    for (char c : text) {
        encoded += codes[c];
    }
    return encoded;
}

解码

解码的实现非常简单，只需要遍历输入的编码，然后根据树的结构找到对应的字符即可。

string decode(const string& encoded) {
    string decoded;
    Node* curr = root;
    for (char bit : encoded) {
        // 0表示左子树，1表示右子树
        if (bit == '0') {
            curr = curr->left;
        } else {
            curr = curr->right;
        }

        if (curr->symbol != '\0') {
            decoded += curr->symbol;
            curr = root;
            // 到叶子节点就返回根节点
        }
    }
    return decoded;
}

主函数

在主函数之前，我们还要编写一个buildFrequencies()函数，用于读取输入的文本，然后统计各个字符出现的频率。
这个函数的算法思路就是构建一个unordered_map，然后遍历输入的文本，如果字符已经出现过，则将其频率加一，否则将其加入到map中。

vector<pair<char, int>> buildFrequencies(const string& str) {
    vector<pair<char, int>> freq;
    unordered_map<char, int> count;

    // Default value for int is 0
    for (char c : str) {
        count[c]++;
    }

    for (const auto& pair : count) {
        freq.push_back(pair);
    }

    return freq;
}

那么接下来我们的主函数就很容易实现了：

int main() {
    string input;
    while(true) {
        cout << "Enter a string (0 to exit): ";
        getline(cin, input);
        if (input == "0") break;

        vector<pair<char, int>> frequencies = buildFrequencies(input);
        HuffmanTree huffman(frequencies);

        cout << "Code Map: " << endl;
        huffman.showCodes();

        string encoded = huffman.encode(input);
        cout << "Encoded text: " << encoded << endl;

        string decoded = huffman.decode(encoded);
        cout << "Decoded text: " << decoded << endl;
    }

    return 0;
}

Huffman Table

Huffman Table的节点和表的定义要稍作修改，但是大体上的接口都可以保持一致，以方便开发。

struct Node {
    char symbol;
    int frequency;
    int parent;
    int left;
    int right;

    Node(char sym, int freq) : 
    	symbol(sym), 
        frequency(freq), 
        parent(-1), 
        left(-1), 
        right(-1) 
    {}
};

class HuffmanTable {
private:
    vector<Node> nodes;
    unordered_map<char, string> codes;

    pair<int, int> getMinTwo() {}

    void buildCodes(Node node, string code="") {}

public:
    HuffmanTable(const vector<pair<char, int>>& frequencies) {}

    // encode : Look up in the codes table
    string encode(const string& text) {}

    // decode : Traverse the tree and decode the encoded text
    string decode(const string& encoded) {}

    void showTable() {}

    void showCodes() {}
};

树的构造

树的构造和之前一样，不过这里需要使用一个辅助函数getMinTwo()来获取最小的两个节点。

pair<int, int> getMinTwo() {
    // min1是最小，min2是第二小
    int min1 = INT_MAX, min2 = INT_MAX;
    int min1_index = -1, min2_index = -1;
    for (int i = 0; i < nodes.size(); i++) {
        if (nodes[i].frequency < min1 && nodes[i].parent == -1) {
            min2 = min1;
            min2_index = min1_index;
            min1 = nodes[i].frequency;
            min1_index = i;
        } else if (nodes[i].frequency < min2 && nodes[i].parent == -1) {
            min2 = nodes[i].frequency;
            min2_index = i;
        }
    }
    return make_pair(min1_index, min2_index);
}

构造函数如下：

HuffmanTable(const vector<pair<char, int>>& frequencies) {
    int n = frequencies.size();
    for (int i = 0; i < n; i++) {
        nodes.push_back(Node(frequencies[i].first, frequencies[i].second));
    }
    for(int i=0; i<n-1; i++) {
        pair<int, int> minTwo = getMinTwo();
        Node parent = Node(
                        '\0', 
                        nodes[minTwo.first].frequency + nodes[minTwo.second].frequency
                    );
        parent.left = minTwo.first;
        parent.right = minTwo.second;
        nodes.push_back(parent);
        nodes[minTwo.first].parent = nodes.size()-1;
        nodes[minTwo.second].parent = nodes.size()-1;
    }
    buildCodes(nodes[nodes.size()-1]);
}

核心的改变就是parent left right全都变成了int类型，需要一些改动。所有原先的node->left node->right node->parent都变成了nodes[nodes[i].right] nodes[node[i].parent]。根节点则是nodes[nodes.size()-1]（最后一个节点）。

所以除开这些改动，Huffman Table的实现和之前一样。所以对于编码解码等操作只要按照上述修改即可。

测试程序

我们进行以下测试：

aaabbc
Hello World! Good morning!
你好世界！你好！

我们运行程序可以得到以下输出：

Enter a string (0 to exit): aaabbc          
Frequencies: 
c: 1
b: 2
a: 3
Huffman Table: 
c 1 3 -1 -1
b 2 3 -1 -1
a 3 4 -1 -1
# 3 4 0 1
# 6 -1 2 3
Code Map: 
b: 11
c: 10
a: 0
Encoded text: 000111110
Decoded text: aaabbc

我们可以看到，程序正确的统计了各个字符出现的频率，并构建了Huffman树，然后正确地编码和解码了输入的文本。

样例二：

Enter a string (0 to exit): Hello World! Good morning!
Frequencies: 
g: 1
o: 5
!: 2
H: 1
e: 1
r: 2
l: 3
 : 3
n: 2
W: 1
G: 1
d: 2
m: 1
i: 1
Huffman Table: 
g 1 14 -1 -1
o 5 24 -1 -1
! 2 17 -1 -1
H 1 14 -1 -1
e 1 15 -1 -1
r 2 18 -1 -1
l 3 21 -1 -1
  3 21 -1 -1
n 2 18 -1 -1
W 1 15 -1 -1
G 1 16 -1 -1
d 2 19 -1 -1
m 1 16 -1 -1
i 1 17 -1 -1
# 2 19 0 3
# 2 20 4 9
# 2 20 10 12
# 3 22 13 2
# 4 22 5 8
# 4 23 11 14
# 4 23 15 16
# 6 24 6 7
# 7 25 17 18
# 8 25 19 20
# 11 26 1 21
# 15 26 22 23
# 26 -1 24 25
Code Map: 
m: 11111
o: 00
!: 1001
l: 010
 : 011
n: 1011
i: 1000
r: 1010
H: 11011
e: 11100
G: 11110
d: 1100
W: 11101
g: 11010
Encoded text: 1101111100010010000111110100101001011001001011111100000110001111111001010101110001011110101001
Decoded text: Hello World! Good morning!

可以看到我们的程序可以正确的处理大小写字母，空格与标点符号。

样例三：

Enter a string (0 to exit): 你好世界！你好！
Frequencies: 
�: 2
�: 2
�: 2
�: 1
�: 1
�: 1
�: 1
�: 2
�: 2
�: 2
�: 1
�: 4
�: 3
Huffman Table: 
� 2 15 -1 -1
� 2 16 -1 -1
� 2 16 -1 -1
� 1 13 -1 -1
� 1 13 -1 -1
� 1 14 -1 -1
� 1 14 -1 -1
� 2 17 -1 -1
� 2 17 -1 -1
� 2 18 -1 -1
� 1 15 -1 -1
� 4 20 -1 -1
� 3 19 -1 -1
# 2 18 3 4
# 2 19 5 6
# 3 20 10 0
# 4 21 1 2
# 4 21 7 8
# 4 22 9 13
# 5 22 14 12
# 7 23 15 11
# 8 23 16 17
# 9 24 18 19
# 15 24 20 21
# 24 -1 22 23
Code Map: 
�: 1111
�: 1110
�: 1100
�: 1001
�: 101
�: 1000
�: 011
�: 0101
�: 0100
�: 0011
�: 1101
�: 0010
�: 000
Encoded text: 011101000111111101010110101100001000010001111001101100101110100011111110101110011011001
Decoded text: 你好世界！你好！

可以看到向中文这样的多字节编码，我们的程序由于每次处理的都是4-bit的char，最后拼接出来的结果仍然会是正确的。