IRResolver(STL set重叠区间以及异构查询)

01 Mar 2020 | C/CPP, STL, Algorithm | | ˚C

代码见 https://github.com/jiangzhuti/IRResolver

问题

有一些区间数据:

$ head data.txt 
3640-3610(s, sh)    O-H stretch, free hydroxyl     alcohols, phenols
3500-3200(s,b)      O–H stretch, H–bonded          alcohols, phenols
3400-3250(m)        N–H stretch                    1°, 2° amines, amides
3300-2500(m)        O–H stretch                    carboxylic acids
3330-3270(n, s)     –C≡C–H: C–H stretch            alkynes (terminal)
3100-3000(s)        C–H stretch                    aromatics
3100-3000(m)        =C–H stretch                   alkenes
3000-2850(m)        C–H stretch                    alkanes
2830-2695(m)        H–C=O: C–H stretch             aldehydes
2260-2210(v)        C≡N stretch                    nitriles

要求给出一个数值,输出所有包含该数值的区间信息。

分析

利用STL set构建查询树。需要注意两点:

  1. 支持按区间插入,按值查询(异构查询)
  2. 处理区间重叠

解决

异构查询

这可以说是一个“很有用,却一直没有实现”的功能,即用户提供自定义比较函数,用来在关联容器中查询一个和Key不同类型的参数。好在从c++14开始,std::less做了void特化,能够十分方便的支持这个功能,在之前只能用boost的侵入式容器来实现,当然不嫌麻烦的话也可以手撸一个查找树。cppreference上有一个简单的例子:https://en.cppreference.com/w/cpp/container/set/find 。

利用这一特性,我们可以做出定义:

struct IRFatKey
{
    unsigned min_freq;
    unsigned max_freq;
    std::vector<std::shared_ptr<std::string>> records;

    IRFatKey() = default;

    IRFatKey(const IRFatKey& other) = default;
};
friend bool operator<(const IRFatKey& k1, const IRFatKey& k2)
{
    return k1.max_freq < k2.min_freq;
}
friend bool operator<(const IRFatKey& k1, const unsigned& k2)
{
    return k1.max_freq < k2;
}
friend bool operator<(const unsigned& k1, const IRFatKey& k2)
{
    return k1 < k2.min_freq;
}

std::set<IRFatKey, std::less<>> ir_set;

处理区间重叠

思路就是每一次插入新的区间之前,先find查找一下,没有查到,则直接插入;如果有结果,说明待插入的数据和查找结果之间存在重叠,然后取出查找结果,跟待插入的区间取交集,分为(最大情况)三个区间;然后三个区间依次递归进行上述插入操作:

void IRResolver::do_insert(const IRFatKey& fat_key)
{
    auto iter = ir_set.find(fat_key);
    if (iter != ir_set.end()) {
        auto k(*iter);
        ir_set.erase(iter);
        std::cout << "erased: " << k.min_freq << "-" << k.max_freq << "\n";
        IRFatKey left, intersection, right;
        intersection.min_freq = (fat_key.min_freq < k.min_freq) ?
            k.min_freq : fat_key.min_freq;
        intersection.max_freq = (fat_key.max_freq < k.max_freq) ?
            fat_key.max_freq : k.max_freq;
        for (const auto& r : fat_key.records) {
            intersection.records.push_back(r);
        }
        for (const auto& r : k.records) {
            intersection.records.push_back(r);
        }
        const IRFatKey* l = nullptr;
        const IRFatKey* r = nullptr;
        if (fat_key.min_freq < intersection.min_freq) {
            l = &fat_key;
        } else if (k.min_freq < intersection.min_freq) {
            l = &k;
        }
        if (fat_key.max_freq > intersection.max_freq) {
            r = &fat_key;
        } else if (k.max_freq > intersection.max_freq) {
            r = &k;
        }
        if (l != nullptr) {
            left.min_freq = l->min_freq;
            left.max_freq = intersection.min_freq - 1;
            for (const auto& r : l->records) {
                left.records.push_back(r);
            }
            do_insert(left);
        }
        if (r != nullptr) {
            right.min_freq = intersection.max_freq + 1;
            right.max_freq = r->max_freq;
            for (const auto& r : r->records) {
                right.records.push_back(r);
            }
            do_insert(right);
        }
        do_insert(intersection);

    } else {
        if (fat_key.records.empty()) {
            std::cout << "!!! " << fat_key.min_freq << " " << fat_key.max_freq << std::endl;
        }
        auto ret = ir_set.insert(fat_key);
        if (!ret.second) {
                std::cerr << "fat_key failed!\n";
                std::cerr << "fat_key.min = " << fat_key.min_freq << " "
                          << "fat_key.max = " << fat_key.max_freq << "\n";
                exit(0);
        }
        std::cout << "inserted: " << fat_key.min_freq << "-" << fat_key.max_freq << "\n";
    }    
}


Older · View Archive (18)

图染色(K-Coloring)问题

Newer

[译]HAT-trie,缓存敏感字典树