哈夫曼树应用-压缩文件

发表于 2022-09-26 更新于 2025-10-26 分类于 CS ，数据结构 Disqus：

大家好~时隔好几天，我来给大家讲哈夫曼树和它的应用了！
在此前，我们很少谈及c/cpp语言对字节或者bit的操作，或者说是二进制文件的操作。因为今天我内容涉及到bit以及字节的操作，所以我们不得不介绍一些c/cpp与字节的知识。
我们将从哈夫曼树开始，构造一棵哈夫曼树，补充一些c/cpp的操作，实现一个简陋的无损压缩&&解压的core。

即使操作bit很吃力，请你坚持实现，不管用什么思路，这将会让你对计算机，有更清楚的认识。甚至有时，你在编码时，脑海里会是一小片内存。

哈夫曼树

有关哈夫曼树，网上有太多的讲解。他们的讲解都十分的出色。为了节约时间，请移步如下学习。

哈夫曼树的理论，不难，希望大家能过一遍理论，再去实践。

构造哈夫曼树

在这些讲解中，普遍存在一个致命的缺陷–避重就轻，屏蔽了代码实现的最难部分。

根据身边同学的反馈。

以下是本次构造的难点

选择两个最小结点。
获取哈夫曼编码。

那么在构造这棵树之前，我们需要定义树结点的结构，以及哈夫曼树的一些操作。

我想我们应该从实践出发，因为我们要做一个压缩文件的程序，首要的就是得到哈夫曼编码。

假设我们有一棵哈夫曼树，我们应该怎么样去获取每个结点的编码呢？

请回忆我们之前所学的前序遍历。


 getCode(root,"");
void HuffmanTree::getCode(node *root, std::string s) {
    if(root== nullptr)return;

    if(root->lChild== nullptr&&root->rChild== nullptr){
        root->code=s;
    }
    getCode(root->lChild,s+"0");
    getCode(root->rChild,s+"1");
}

我们只需要一次前序遍历。在访问左节点时，s+“0”，右结点时s+“1”，当到叶子结点时，令该结点的code等于s。

眼下，我们需要构造一棵哈夫曼树。

我们很容易定义如下的结点结构。

class node{
public:
    unsigned char data;
    node* lChild{nullptr};
    node* rChild{nullptr};
    unsigned long long wight;
    int parent{-1};
    std::string code;

    node()=default;
    node(int t_data):data(t_data){}
};

我们还需要一个临时的数组，自下而上的构造哈夫曼树。

但结点是n是，我们只需要一张长度位2xn-1的表就行了，因此我们采用一个指针，动态分配2xn-1空间。

n个结点的构造，我们需要这些结点的权重，一张统计文件过各个字符的数目的哈希表来初始化这些结点，再好不过了。

class HuffmanTree{
public:
    node* huffmanTable; //哈夫曼树构造表
    node* root;//根节点

    HuffmanTree()=default;
    ~HuffmanTree();
    HuffmanTree(std::unordered_map<unsigned char ,unsigned long long>&wightTable );//初始化结点

    void forEachPre() const; //树的前序遍历，可以检测是否成功构造


};

首先是通过哈希表初始化结点

HuffmanTree::HuffmanTree(std::unordered_map<unsigned char, unsigned long long>&wightTable) {
    create(wightTable);
}

void HuffmanTree::create(std::unordered_map<unsigned char, unsigned long long> &wightTable) {
    huffmanTable = new node[2*wightTable.size()-1];

    //insert wight node
    int index{0}; //this is a huffmanTable point.
    for(auto &w:wightTable){
        huffmanTable[index].data = w.first;
        huffmanTable[index].wight = w.second;
        index++;
    }


    //important!
    //find two min wight node.

    for(;index<2*wightTable.size()-1;index++){
        connectNode(index);
    }


    //return root node;
    this->root = &huffmanTable[2*wightTable.size()-2];
    getCode();
    //std::cout<<"huffman tree create successful!"<<std::endl;
}

当初始完毕后，就可以构造哈夫曼树了，上述代码不难，但我们需要实现，选择两个最小结点，这点很重要。

利用贪心算法，在排除存在父结点的结点中挑选，合成结点并插入在huffmanTable中的insert_index位置上。

判断是否存在父结点，可以用parent是否等于-1来判断。

void HuffmanTree::connectNode(int insert_index) {
    //find two min wight node in huffmanTable.
    //lchild and rchild;
    node* l;
    node* r;
    unsigned long long min_temp{0};
    //找到第一个没有父节点的结点
    for(int i=0;i<insert_index;i++){
        if(huffmanTable[i].parent==-1){
            min_temp = huffmanTable[i].wight;
            break;
        }
    }
	//贪心1
    for(int i=0;i<insert_index;i++){
        if(huffmanTable[i].wight<=min_temp&&huffmanTable[i].parent==-1){
            min_temp = huffmanTable[i].wight;
            l = &huffmanTable[i];
        }
    }
    l->parent = insert_index;
//找到第一个没有父节点的结点
    for(int i=0;i<insert_index;i++){
        if(huffmanTable[i].parent==-1){
            min_temp = huffmanTable[i].wight;
            break;
        }
    }
	//贪心2
    for(int i=0;i<insert_index;i++){
        if(huffmanTable[i].wight<=min_temp&&huffmanTable[i].parent==-1){
            min_temp = huffmanTable[i].wight;
            r = &huffmanTable[i];
        }
    }
    r->parent=insert_index;

    //insert node
    huffmanTable[insert_index].lChild = l;
    huffmanTable[insert_index].rChild = r;
    huffmanTable[insert_index].wight = l->wight+r->wight;
}

以上操作就可以构造一棵哈夫曼树。

以下是完整代码

huffmanTree.h

//
// Created by zhujiyuan on 2022/9/23.
//

#ifndef HUFFMANZIP_HUFFMANTREE_H
#define HUFFMANZIP_HUFFMANTREE_H
#include <string>
#include <unordered_map>
class node{
public:
    unsigned char data;
    node* lChild{nullptr};
    node* rChild{nullptr};
    unsigned long long wight;
    int parent{-1};
    std::string code;

    node()=default;
    node(int t_data):data(t_data){}
};

class HuffmanTree{
public:
    node* huffmanTable;
    node* root;

    HuffmanTree()=default;
    ~HuffmanTree();
    HuffmanTree(std::unordered_map<unsigned char ,unsigned long long>&wightTable );

    void forEachPre() const;

private:
    void create(std::unordered_map<unsigned char ,unsigned long long>&wightTable );
    //insert connect node
    void connectNode(int insert_index);
    void forEachPre(node *root) const;
    void getCode();
    void getCode(node *root,std::string s);
};

#endif //HUFFMANZIP_HUFFMANTREE_H

huffmanTree.cpp

//
// Created by zhujiyuan on 2022/9/23.
//

#include "huffmanTree.h"
#include <iostream>


HuffmanTree::HuffmanTree(std::unordered_map<unsigned char, unsigned long long>&wightTable) {
    create(wightTable);
}

//create huffman tree with wight table.
//new node
void HuffmanTree::create(std::unordered_map<unsigned char, unsigned long long> &wightTable) {
    huffmanTable = new node[2*wightTable.size()-1];

    //insert wight node
    int index{0}; //this is a huffmanTable point.
    for(auto &w:wightTable){
        huffmanTable[index].data = w.first;
        huffmanTable[index].wight = w.second;
        index++;
    }


    //important!
    //find two min wight node.

    for(;index<2*wightTable.size()-1;index++){
        connectNode(index);
    }


    //return root node;
    this->root = &huffmanTable[2*wightTable.size()-2];
    getCode();
    //std::cout<<"huffman tree create successful!"<<std::endl;
}

void HuffmanTree::connectNode(int insert_index) {
    //find two min wight node in huffmanTable.
    //lchild and rchild;
    node* l;
    node* r;
    unsigned long long min_temp{0};
    for(int i=0;i<insert_index;i++){
        if(huffmanTable[i].parent==-1){
            min_temp = huffmanTable[i].wight;
            break;
        }
    }

    for(int i=0;i<insert_index;i++){
        if(huffmanTable[i].wight<=min_temp&&huffmanTable[i].parent==-1){
            min_temp = huffmanTable[i].wight;
            l = &huffmanTable[i];
        }
    }
    l->parent = insert_index;

    for(int i=0;i<insert_index;i++){
        if(huffmanTable[i].parent==-1){
            min_temp = huffmanTable[i].wight;
            break;
        }
    }

    for(int i=0;i<insert_index;i++){
        if(huffmanTable[i].wight<=min_temp&&huffmanTable[i].parent==-1){
            min_temp = huffmanTable[i].wight;
            r = &huffmanTable[i];
        }
    }
    r->parent=insert_index;

    //insert node
    huffmanTable[insert_index].lChild = l;
    huffmanTable[insert_index].rChild = r;
    huffmanTable[insert_index].wight = l->wight+r->wight;
}

void HuffmanTree::forEachPre(node *root) const {
    if(root== nullptr)return;

    if(root->lChild== nullptr&&root->rChild== nullptr)
        std::cout<<char(root->data)<<" ";
    forEachPre(root->lChild);
    forEachPre(root->rChild);
}

void HuffmanTree::forEachPre() const {
    forEachPre(root);
}

void HuffmanTree::getCode(node *root, std::string s) {
    if(root== nullptr)return;

    if(root->lChild== nullptr&&root->rChild== nullptr){
        root->code=s;
    }
    getCode(root->lChild,s+"0");
    getCode(root->rChild,s+"1");
}

void HuffmanTree::getCode() {
    getCode(this->root,"");
}

HuffmanTree::~HuffmanTree() {
    if(huffmanTable!= nullptr){
        delete[] huffmanTable;
        huffmanTable = nullptr;
    }
}

压缩与解压

压缩难在，操作字节。

我们先不去考虑这个最难的部分。

每个文件的大小都是字节的整数倍，我们可以字节作为结点的数据，统计这个字节出现过多少次，来构造哈夫曼树。

一个字节0-255，所以最多有256个结点。

构造完哈夫曼树之后，开始编码。

逐个字节读取源文件，把这个字节转成哈夫曼编码。

[用一个字符串存储每个字节的哈夫曼编码]{.red}，这样每个8个字符，写入一个字节就行啦。

要注意的是，最后可能不满8个字节，需要在后面补0，补足8位。

这就是压缩的过程啦。

解压的过程，可以通过逐位读取压缩文件

0访问左子树
1访问右子树
叶子结点，把结点的字节，写入解压文件中。

思路不一，以下是这个思路的实现代码。

compress.h

//
// Created by zhujiyuan on 2022/9/25.
//

#ifndef HUFFMANZIP_COMPRESS_H
#define HUFFMANZIP_COMPRESS_H
#include "huffmanTree.h"
#include <unordered_map>
class FileIO{
public:
    unsigned char*buf;
    long buf_size;
    FileIO()=default;
    void readIntoBuf(const char* fileName);
};

class Compress:public FileIO{
public:
    HuffmanTree *h_tree;
    int endBit;//add 0 number
    std::unordered_map<unsigned char,unsigned long long>hash;
    Compress()=default;
    Compress(const char* filename);

    void compressFile();
    void decompressFile();
private:
    void init(const char* filename);
    inline unsigned char strBinToUchar(const std::string&s);
    inline std::string ucharToStrBin(unsigned char s);
};


inline unsigned char Compress::strBinToUchar(const std::string &s) {
    unsigned char sum{0};
    for(const char&a:s){
        sum = sum*2 + (a-'0');
    }
    return sum;
}

inline std::string Compress::ucharToStrBin(unsigned char s){
    std::string ans = "";
    while(s!=0){
        if(s%2!=0){
            ans = '1'+ans;
        }else{
            ans = '0'+ans;
        }
        s/=2;
    }
    while(ans.length()!=8){
        ans = '0'+ans;
    }
    return ans;
}
#endif //HUFFMANZIP_COMPRESS_H

compress.cpp

//
// Created by zhujiyuan on 2022/9/25.
//

#include "compress.h"
#include <unordered_map>
#include <iostream>
#include <string.h>
#include "processBar.h"
void FileIO::readIntoBuf(const char *fileName) {

    FILE * q = fopen(fileName,"rb");
    //get file size
    fseek(q,0,SEEK_END);
    this->buf_size = ftell(q);
    //create file size buf
    this->buf = new unsigned char[buf_size];
    fseek(q,0,SEEK_SET);
    fread(this->buf,1,buf_size,q);
    fclose(q);

}

Compress::Compress(const char *filename) {
    init(filename);
}

void Compress::init(const char *filename) {
    readIntoBuf(filename);

    for(long i{0};i<buf_size;i++){
        hash[buf[i]]++;
    }
    h_tree = new HuffmanTree(hash);
}

void Compress::compressFile() {
    //create find table,key is node's data,value is node's code.
    std::unordered_map<unsigned char,std::string> quickTable;
    for(int i{0};i<hash.size();i++){
        quickTable[this->h_tree->huffmanTable[i].data]=this->h_tree->huffmanTable[i].code;
    }

    FILE *q = fopen("test.hip","wb");
    //start compress!!!
    std::string byteBuf="";

    unsigned char streamBuf[1048576];
    memset(streamBuf,0,sizeof(streamBuf));

    int cnt{0};
    for(long i{0};i<buf_size;i++){
        byteBuf.append(quickTable[buf[i]]);
    }
    //add 0 to bytrBuf
    endBit=0;
    while(byteBuf.length()%8!=0){
        byteBuf.push_back('0');
        endBit++;
    }
    //delete buf
    if(buf!= nullptr){
        delete[]buf;
        buf = nullptr;
    }
    //write into stream buf.
    std::cout<<"start compress file."<<std::endl;
    ProcessBar bar(byteBuf.length());
    for(long i{0};i<byteBuf.length();i+=8){
        streamBuf[cnt++] = strBinToUchar(byteBuf.substr(i,8));
        if(cnt%1048576==0){
            cnt=0;
            fwrite(streamBuf,sizeof(streamBuf),1,q);
        }
        bar.show(8);
    }
    if(cnt!=0){
        fwrite(streamBuf,sizeof(unsigned char)*cnt,1,q);
    }
    fclose(q);
    std::cout<<std::endl<<"compress over!"<<std::endl;
}

void Compress::decompressFile() {
    node* sign = this->h_tree->root;
    unsigned char streamBuf[1048576];
    memset(streamBuf,0,sizeof(streamBuf));
    FILE *q = fopen("test.hip","rb");
    FILE *uq = fopen("test.unhip","wb");
    int cnt{0};

    std::string codeTxt="";
    unsigned char t;

    while(fread(&t,sizeof(unsigned char),1,q)!=0){
        codeTxt.append(ucharToStrBin(t));
    }
    while(endBit!=0){
        codeTxt.pop_back();
        endBit--;
    }
    //std::cout<<codeTxt<<std::endl;
    fclose(q);
    std::cout<<"start decompress file."<<std::endl;
    ProcessBar bar(codeTxt.length());
    for(const char&s:codeTxt){
        if(s=='0'){
            sign=sign->lChild;
        }else{
            sign=sign->rChild;
        }
        if(sign->rChild== nullptr&&sign->lChild== nullptr){
            streamBuf[cnt++]=sign->data;
            sign=this->h_tree->root;
            if(cnt%1048576==0){
                fwrite(streamBuf,sizeof (streamBuf),1,uq);
                cnt=0;
            }
        }
        bar.show();
    }
    if (cnt!=0){
        fwrite(streamBuf,sizeof (unsigned char )*cnt,1,uq);
    }
    fclose(uq);
    std::cout<<std::endl<<"decompress over!"<<std::endl;
}

有关c语言的知识，可以参考如下内容

文件操作函数

完整代码=>github

后续可能会优化压缩代码。逃~