c++string hash unordered-map unordered-multimap

Collisions in unordered_multimap when iterating through bucket via local_it

In the code below, I have a number of strings (DNA sequences) that I am storing in a vector. I have a struct, read_tag that I use to identify each string; read_tag.read_id is the string identifier. I take 30 character substrings of each string and use it as a key in an unordered_multimap with the read_tag as a value; the aim is to group strings that share a 30 character sequence. Naturally, identical string will hash to the same value, and end up in the same bucket in the multi map. Offset is used to give the "shift" from index zero of the 30 character tag.

However, when I run this code, iterating through each bucket; I find that there are multiple different sequences in the same bucket. I thought collisions were resolved in unordered_mutlimap, and therefore in a bucket, their should just be one key (string). I understand that collision can happen, but I thought that either chaining, probing etc was implemented in unordered_mutlimap. You should be able to run and check output to see where I am getting confused.

I also std::hash the each key, one in a bucket, and I find that the keys in the "collisions" have a different hash value.

So, it's as if collisions are happening, resulting in values with diff. keys in the same bucket, but in contradiction, the keys hashed to different vals. Is their a way to avoid this, and differentiate values based on keys within buckets? Or do I need to do implement this?

#include <iostream>                                                                                   
#include <string>                                                                                     
#include <unordered_map>                                                                              
#include <vector>                                                                                     
#include <functional>                                                                                 

using namespace std;                                                                                  


int main() {                                                                                          


  vector<string>  reads;                                                                              

  reads.push_back("CCAGCTGCTCTCACCCTGGGCAGGGTCCCTGCACACACTGTATCTTTTGAGGTCCCTTCAGGACCCCGGTTTGCTGCCTC");
  reads.push_back("CCAGCTGCTCTCACCCTGGGCAGGGTCCCTGCACACACTGTATCTTTTGAGGTCCCTTCAGGACCCCGGTTTGCTGCCTC");
  reads.push_back("GGCAGGGTCATACCCGATTAACTTGTTATAGAGTATGGGGCATCAACTTGGGCAGCAATGGGGAACGGTGTCTCTGGAAG");
  reads.push_back("CCAGCTGCTCTCACCCTGGGCAGGGTCCCTGCACACACTGTATCTTTTGAGGTCCCTTCAGGACCCCGGTTTGCTGCCTC");
  reads.push_back("GGCAGGGTCATACCCGATTAACTTGTTATAGAGTATGGGGCATCAACTTGGGCAGCAATGGGGAACGGTGTCTCTGGAAG");
  reads.push_back("GGCAGGGTCATACCCGATTAACTTGTTATAGAGTATGGGGCATCAACTTGGGCAGCAATGGGGAACGGTGTCTCTGGAAG");
  reads.push_back("GGCAGGGTCATACCCGATTAACTTGTTATAGAGTATGGGGCATCAACTTGGGCAGCAATGGGGAACGGTGTCTCTGGAAG");
  reads.push_back("CCGGGCGTGGTGGCGTGCACCTGTAATCCCAGCTACTTGGGATGTTCAGGCAGGAGACTCGCTTGATCCCCGGGGACGGA");
  reads.push_back("CCGGGCGTGGTGGCGTGCACCTGTAATCCCAGCTACTTGGGATGTTCAGGCAGGAGACTCGCTTGATCCCCGGGGACGGA");
  reads.push_back("CCGGGCGTGGTGGCGTGCACCTGTAATCCCAGCTACTTGGGATGTTCAGGCAGGAGACTCGCTTGATCCCCGGGGACGGA");
  reads.push_back("CCGGGCGTGGTGGCGTGCACCTGTAATCCCAGCTACTTGGGATGTTCAGGCAGGAGACTCGCTTGATCCCCGGGGACGGA");
  reads.push_back("CCAGCTGCTCTCACCCTGGGCAGGGTCCCTGCACACACTGTATCTTTTGAGGTCCCTTCAGGACCCCGGTTTGCTGCCTC");

  struct read_tag{                                                                                    
    unsigned int read_id;    // unique string identifier                                                                          
    int offset;              // shift of 30 character substring represented by tag                                                                                                                                            
  };                                                                                                  

  unordered_multimap<string, read_tag> mutation_grouper;                                              

  for(int read_id=0; read_id < reads.size(); read_id++) {                                             
    string read = reads[read_id];                                                                                              
    for(int i=0; i < read.size()-30; i++) {                                                                                                                            
      string sub_read = read.substr(i, 30);                                                           
      read_tag next_tag;                                                                              
      pair<string, read_tag> key_val;                                                                 

      next_tag.read_id = read_id;                                                                     
      next_tag.offset = i;                                                                                                                                             

      key_val.first = sub_read;                                                                       
      key_val.second = next_tag;                                                                      

      mutation_grouper.insert(key_val);                                                               
    }                                                                                                 
  }                                                                                                   

  cout << "mutation_grouper buckets" << endl;                                                         
  std::hash<std::string> hash_er;                                                                     

  for(unsigned int bucket = 0;  bucket < mutation_grouper.bucket_count(); bucket++) {

    cout << "Bucket: " << bucket << endl;                                                    
    for( auto local_it = mutation_grouper.begin(bucket);                                     
     local_it != mutation_grouper.end(bucket); ++local_it) {                             

      cout << local_it->first << " : " << local_it->second.read_id                           
      << ", " << local_it->second.offset << ", " << endl;                                               

      cout << "hash value: " << local_it->first <<"::: " << hash_er(local_it->first) << endl;

     }                                                                                        
     cout << endl << endl;                                                                    
   }                                                                                          
 }

Solution

Yes, your are correct. It is not guaranteed, that two different items land in two different buckets. You know only, that two same items land in the same bucket.

The solution to your problem is simply to avoid buckets. The class unordered_multimap (as well as multimap) has the method equal_range, which gives you the range of elements with a specific key. So you only have to iterate over all keys, and use equal_range to iterate over all values. Sadly there is no method, that allows you to iterate over the keys, so you have to be a little tricky. The following code should give you the desired output:

// iterate through all elements in the multimap
// don't worry, we'll skip a bunch
for (auto it = mutation_grouper.begin(); it != mutation_grouper.end(); )
{
    // Get the range of the current key
    auto range = mutation_grouper.equal_range(it->first);

    // Print all elements of the range
    cout << it->first << endl;
    for (auto local_it = range.first; local_it != range.second; ++local_it)
        std::cout << "   " << local_it->second.read_id << " " << local_it->second.offset << '\n';

    // Step to the end of the range
    it = range.second;
}