stringalgorithmdata-structurestrie

How to find all words that "contain" or "unscramble" an input string using a Trie?


Here is a trie made to work on alphabets/scripts from any language.

class TrieNode {
  constructor(key) {
    // the "key" value will be the character in sequence
    this.key = key;

    // we keep a reference to parent
    this.parent = null;

    // we have hash of children
    this.children = {};

    // check to see if the node is at the end
    this.end = false;
  }

  getWord() {
    let output = [];
    let node = this;

    while (node !== null) {
      output.unshift(node.key)
      node = node.parent
    }

    return output.join('')
  }
}

class Trie {
  constructor() {
    this.base = new TrieNode(null)
  }

  insert(word) {
    let node = this.base

    const points = Array.from(word)

    for (const i in points) {
      const point = points[i]
      if (!node.children[point]) {
        const child = node.children[point] = new TrieNode(point)
        child.parent = node
      }

      node = node.children[point]

      if (i == word.length - 1) {
        node.end = true
      }
    }
  }

  contains(word) {
    let node = this.base

    const points = Array.from(word)

    for (const i in points) {
      const point = points[i]

      if (node.children[point]) {
        node = node.children[point]
      } else {
        return false
      }
    }

    return node.end;
  }

  find(prefix) {
    let node = this.base
    let output = []

    const points = Array.from(prefix)

    for (const i in points) {
      const point = points[i]

      // make sure prefix actually has words
      if (node.children[point]) {
        node = node.children[point]
      } else {
        // there's none. just return it.
        return output
      }
    }

    const stack = [node]
    while (stack.length) {
      node = stack.shift()
      // base case, if node is at a word, push to output
      if (node.end) {
        output.unshift(node.getWord())
      }

      // iterate through each children, call recursive findAllWords
      for (var child in node.children) {
        stack.push(node.children[child])
      }
    }

    return output
  }
}

After asking How to efficiently store 1 million words and query them by starts_with, contains, or ends_with? and getting some answers, I am wondering how the "contains" and "unscrambles" parts can be implemented as a trie. The prefix ("starts with") search is easily handled by the trie.

const fs = require('fs')
const Trie = require('./Trie')

const words = fs.readFileSync('tmp/scrabble.csv', 'utf-8')
  .trim()
  .split(/\n+/)
  .map(x => x.trim())

const trie = new Trie()

words.forEach(word => trie.insert(word))

console.log(trie.find('zy'))
[
  'zygodactylous', 'zygomorphies', 'zygapophysis',
  'zygapophyses',  'zygomorphic',  'zymologies',
  'zygospores',    'zygosities',   'zygomorphy',
  'zygodactyl',    'zymurgies',    'zymograms',
  'zymogenes',     'zygotenes',    'zygospore',
  'zygomatic',     'zyzzyvas',     'zymosans',
  'zymology',      'zymogram',     'zymogens',
  'zymogene',      'zygotene',     'zygosity',
  'zygomata',      'zyzzyva',      'zymurgy',
  'zymotic',       'zymosis',      'zymoses',
  'zymosan',       'zymogen',      'zymases',
  'zygotic',       'zygotes',      'zygosis',
  'zygoses',       'zygomas',      'zydecos',
  'zymase',        'zygote',       'zygose',
  'zygoma',        'zygoid',       'zydeco',
  'zymes',         'zyme'
]

Here is the tmp/scrabble.csv I used.

The question is, can a trie be used to find all the words which "contain" some input string, or find all the words which "unscramble" the input string? I am curious how to accomplish this with a Trie. If a trie cannot do this efficiently, then knowing why not, and potentially what I should be looking at instead, will be a great answer as well.

From my initial thought-attempts at solving "contains", it seems I would have to create a trie which maps all possible combinations of substrings to the final word, but that seems like it will be an explosion of memory, so not sure how to better reason about this.

For "unscrambles", where you put in "caldku" and it finds "duck", amongst other possible words, I think possibly, similar to the linked answer to the other question, maybe sorting "duck" into "cdku", and then storing that in the trie, and then sorting the input to "acdklu", and then searching for "contains" using the previous algorithm, but hmm, no that will break in a few cases. So maybe a trie is not the right approach for these two problems? If it is, roughly how would you do it (you don't need to provide a full implementation, unless you'd like and it's straightforward enough).


Solution

  • Here is an answer to unscrambles that ALSO handles pagination. It actually returns [words, startOfNextPage]. So to find the next page you call it starting at startOfNextPage.

    The idea is to normalize words by sorting the letters, and then storing the word in the trie once. When searching we can dig into the tree both for the next letter we want, and any letter before it (because the letter we want may come after). So we also store a lookup to know what letters might be found, so that we can break out early without having to look at a lot of bad options. And, of course, we include counts so that we can calculate how much of the trie we have explored and/or skipped over.

    class Trie {
        constructor(depth=0) {
            this.depth = depth;
            this.children = {};
            this.words = [];
            this.count = 0;
            this.isSorted = false;
            this.hasChildWith = {};
        }
    
        insert (word) {
            if (word != undefined) {
                this._insert(Array.from(word).sort(), word);
            }
        }
    
        _insert (key, word) {
            this.count++;
            this.isSorted = false;
    
            if (key.length == this.depth) {
                this.words.push(word);
            }
            else {
                for (let i = this.depth+1; i < key.length; i++) {
                    this.hasChildWith[key[i]] = true;
                }
                if (! this.children[key[this.depth]] )  {
                    this.children[key[this.depth]] = new Trie(this.depth+1);
                }
                this.children[key[this.depth]]._insert(key, word);
            }
        }
    
        sort () {
            if (! this.isSorted) {
                this.words.sort();
                let letters = Object.keys(this.children);
                letters.sort();
    
                // Keys come out of a hash in insertion order.
                let orderedChildren = {};
                for (let i = 0; i < letters.length; i++) {
                    orderedChildren[letters[i]] = this.children[letters[i]];
                }
                this.children = orderedChildren;
            }
        }
    
        find (letters, startAt=0, maxCount=null) {
            return this._find(Array.from(letters).sort(), 0, startAt, maxCount || this.count, 0);
        }
    
        _find (key, keyPos, startAt, maxCount, curPos) {
            if (curPos + this.count < startAt) {
                return [[], curPos + this.count];
            }
    
            this.sort(); // Make sure we are sorted.
    
            let answer = [];
            if (keyPos < key.length) {
                // We have not yet found all the letters.
    
                // tmpPos will track how much we looked at in detail
                let tmpPos = curPos;
                tmpPos += this.words.length;
                for (const [k, v] of Object.entries(this.children)) {
                    if (k < key[keyPos]) {
                        // The next letter we want can be deeper in the trie?
                        if (v.hasChildWith[key[keyPos]]) {
                            // It is!  Let's find it.
                            let result = v._find(key, keyPos, startAt, maxCount - answer.length, tmpPos);
                            answer = answer.concat(result[0]);
                            if (maxCount <= answer.length) {
                                // We finished our answer, return it and what we found.
                                return [answer, result[1]];
                            }
                        }
                        tmpPos += v.count; // Didn't find it, but track that we've seen this.
                    }
                    else if (k == key[keyPos]) {
                        // We found our next letter! Search deeper.
                        let result = v._find(key, keyPos+1, startAt, maxCount - answer.length, tmpPos);
                        answer = answer.concat(result[0]);
                        if (maxCount <= answer.length) {
                            // We finished the search.
                            return [answer, result[1]];
                        }
                        else {
                            // No other letter can match.
                            break;
                        }
                    }
                    else {
                        // Neither this or any later letter can match.
                        break;
                    }
                }
                // Return our partial answer and mark that we went
                // through this whole node.
                return [answer, curPos + this.count];
            }
            else {
                // We have our letters and are recursively finding words.
                if (startAt <= curPos + this.words.length) {
                    answer = this.words.slice(startAt - curPos);
                    if (maxCount <= answer.length) {
                        // No need to search deeper, we're done.
                        answer = answer.slice(0, maxCount);
                        return [answer, curPos + answer.length];
                    }
                    curPos += answer.length;
                }
                for (const child of Object.values(this.children)) {
                    let result = child._find(key, keyPos, startAt, maxCount - answer.length, curPos);
                    answer = answer.concat(result[0])
                    if (maxCount <= answer.length) {
                        return [answer, result[1]];
                    }
                    else {
                        curPos += child.count;
                    }
                }
                return [answer, curPos];
            }
        }
    }
    
    exports.Trie = Trie
    

    And here is some code to test it with.

    const fs = require('fs')
    const AnagramTrie = require('./AnagramTrie')
    
    const words = fs.readFileSync('tmp/scrabble.csv', 'utf-8')
      .trim()
      .split(/\n+/)
      .map(x => x.trim())
    
    const trie = new AnagramTrie.Trie()
    
    words.forEach(word => trie.insert(word))
    
    console.log(trie.find('b', 0, 12))
    
    console.log(trie.find('b', 0, 6))
    console.log(trie.find('b', 11, 6))
    
    console.log(trie.find('zx', 0, 12))
    console.log(trie.find('zx', 0, 6))
    console.log(trie.find('zx', 60875, 6))
    

    It will return words in the order of their sorted anagram, followed by the word itself. So if you search for admired you'll find unadmired first because the n in un comes before the r in admired. And you'll find that disembarrassed comes before either because it has 2 a's in it.