google-bigquerysimilaritytrigram

How to perform trigram operations in Google BigQuery?


I do use the pg_trgm module in PostgreSQL to calculate similarity between two strings using trigrams. Particularly I use:

similarity(text, text)

Which returns returns a number that indicates how similar the two arguments are (between 0 and 1).

How can I perform similarity function (or equivalent) on Google BigQuery?


Solution

  • Try below. At least as a blueprint for enhancing

    SELECT text1, text2, similarity FROM 
    JS(
    // input table
    (
      SELECT * FROM 
      (SELECT 'mikhail' AS text1, 'mikhail' AS text2),
      (SELECT 'mikhail' AS text1, 'mike' AS text2),
      (SELECT 'mikhail' AS text1, 'michael' AS text2),
      (SELECT 'mikhail' AS text1, 'javier' AS text2),
      (SELECT 'mikhail' AS text1, 'thomas' AS text2)
    ) ,
    // input columns
    text1, text2,
    // output schema
    "[{name: 'text1', type:'string'},
      {name: 'text2', type:'string'},
      {name: 'similarity', type:'float'}]
    ",
    // function
    "function(r, emit) {
    
      var _extend = function(dst) {
        var sources = Array.prototype.slice.call(arguments, 1);
        for (var i=0; i<sources.length; ++i) {
          var src = sources[i];
          for (var p in src) {
            if (src.hasOwnProperty(p)) dst[p] = src[p];
          }
        }
        return dst;
      };
    
      var Levenshtein = {
        /**
         * Calculate levenshtein distance of the two strings.
         *
         * @param str1 String the first string.
         * @param str2 String the second string.
         * @return Integer the levenshtein distance (0 and above).
         */
        get: function(str1, str2) {
          // base cases
          if (str1 === str2) return 0;
          if (str1.length === 0) return str2.length;
          if (str2.length === 0) return str1.length;
    
          // two rows
          var prevRow  = new Array(str2.length + 1),
              curCol, nextCol, i, j, tmp;
    
          // initialise previous row
          for (i=0; i<prevRow.length; ++i) {
            prevRow[i] = i;
          }
    
          // calculate current row distance from previous row
          for (i=0; i<str1.length; ++i) {
            nextCol = i + 1;
    
            for (j=0; j<str2.length; ++j) {
              curCol = nextCol;
    
              // substution
              nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
              // insertion
              tmp = curCol + 1;
              if (nextCol > tmp) {
                nextCol = tmp;
              }
              // deletion
              tmp = prevRow[j + 1] + 1;
              if (nextCol > tmp) {
                nextCol = tmp;
              }
    
              // copy current col value into previous (in preparation for next iteration)
              prevRow[j] = curCol;
            }
    
            // copy last col value into previous (in preparation for next iteration)
            prevRow[j] = nextCol;
          }
    
          return nextCol;
        }
    
      };
    
      var the_text1;
    
      try {
        the_text1 = decodeURI(r.text1).toLowerCase();
      } catch (ex) {
        the_text1 = r.text1.toLowerCase();
      }
    
      try {
        the_text2 = decodeURI(r.text2).toLowerCase();
      } catch (ex) {
        the_text2 = r.text2.toLowerCase();
      }
    
      emit({text1: the_text1, text2: the_text2,
            similarity: 1 - Levenshtein.get(the_text1, the_text2) / the_text1.length});
    
      }"
    )
    ORDER BY similarity DESC
    

    This is light modification based on https://storage.googleapis.com/thomaspark-sandbox/udf-examples/pataky.js by @thomaspark