powerbipowerquerym

Counting matching pairs of letters in lists of bigrams in Power Query M


Having had great success with my previous question about bigram decomposition in M, I am now encountering a new problem.

I am trying to count the matching pairs between two lists of bigrams and both methods I have tried sometimes undercount and I cannot put my finger on why exactly. As an example, the matches between {"He","el","ll","lo"} and {"He","el","lo"} correctly counts 3 matches (well, 6, but I am intentionally doubling the count within the functions), whereas {"He","el","ll","lo"} and {"Hi","il","lo"} incorrectly counts 0 matches instead of 1.

I am using List.Sort() for inputs prior to implementation, though the issue is invariant to the sorting.

Both my functions are based on the counter within the java implementation of the Dice similarity page in wikibooks

snippet here:

int matches = 0, i = 0, j = 0;
    while (i < n && j < m)
    {
        if (sPairs[i] == tPairs[j])
        {
            matches += 2;
            i++;
            j++;
        }
        else if (sPairs[i] < tPairs[j])
            i++;
        else
            j++;
    }

My initial crack at it resulted in a recursive function:

(x as list, y as list, i as number, j as number, matches) as number => 
let 
    matcher = if x{i} = y{j} then matches + 2 else matches,
    ineq = if x{i} < y{j} then 1 else 0,
    Check = if i = List.Count(x) - 1 or j = List.Count(y) - 1 then matcher
            else if matcher > matches then @Counter1(x,y,i+1,j+1,matcher)
            else if ineq = 1 then @Counter1(x,y,i+1,j,matcher)
            else @Counter1(x,y,i,j+1,matcher)
in Check

In order to avoid the recursion due to potential speed issues, I also had copilot write me a function using list.accumulate

(x as list, y as list) as number => 
let 
    n = List.Count(x),
    m = List.Count(y),
    matches = List.Accumulate(
        {0..n-1},
        [i = 0, j = 0, matches = 0],
        (state, current) =>
            if state[i] < n and state[j] < m then
                if x{state[i]} = y{state[j]} then
                    [i = state[i] + 1, j = state[j] + 1, matches = state[matches] + 2]
                else if x{state[i]} < y{state[j]} then
                    [i = state[i] + 1, j = state[j], matches = state[matches]]
                else
                    [i = state[i], j = state[j] + 1, matches = state[matches]]
            else
                state
    )[matches]
in
    matches

From what I can see, both functions give identical outputs, and the second definitely feels faster, though this means they also both give that undercounting issue.

The only thing that comes to mind is that the java code I adapted makes use of the binary representation of the letter pairs to compare them whereas I'm unsure as to how M is comparing the letter pairs in my functions.

Any help would be greatly appreciated!


Solution

  • let list1= {"He","el","ll","lo"}, //alternatively Source[Column1]
    list2={"He","el","lo"} , //alternatively Source[Column2]
    matching_list = List.Intersect({list1,list2}), // list of matches
    count_matching=List.Count(List.Intersect({list1,list2}))  // count of matches
    in count_matching