Having had great success with my previous question about bigram decomposition in M, I am now encountering a new problem.
I am trying to count the matching pairs between two lists of bigrams and both methods I have tried sometimes undercount and I cannot put my finger on why exactly. As an example, the matches between {"He","el","ll","lo"} and {"He","el","lo"} correctly counts 3 matches (well, 6, but I am intentionally doubling the count within the functions), whereas {"He","el","ll","lo"} and {"Hi","il","lo"} incorrectly counts 0 matches instead of 1.
I am using List.Sort() for inputs prior to implementation, though the issue is invariant to the sorting.
Both my functions are based on the counter within the java implementation of the Dice similarity page in wikibooks
snippet here:
int matches = 0, i = 0, j = 0;
while (i < n && j < m)
{
if (sPairs[i] == tPairs[j])
{
matches += 2;
i++;
j++;
}
else if (sPairs[i] < tPairs[j])
i++;
else
j++;
}
My initial crack at it resulted in a recursive function:
(x as list, y as list, i as number, j as number, matches) as number =>
let
matcher = if x{i} = y{j} then matches + 2 else matches,
ineq = if x{i} < y{j} then 1 else 0,
Check = if i = List.Count(x) - 1 or j = List.Count(y) - 1 then matcher
else if matcher > matches then @Counter1(x,y,i+1,j+1,matcher)
else if ineq = 1 then @Counter1(x,y,i+1,j,matcher)
else @Counter1(x,y,i,j+1,matcher)
in Check
In order to avoid the recursion due to potential speed issues, I also had copilot write me a function using list.accumulate
(x as list, y as list) as number =>
let
n = List.Count(x),
m = List.Count(y),
matches = List.Accumulate(
{0..n-1},
[i = 0, j = 0, matches = 0],
(state, current) =>
if state[i] < n and state[j] < m then
if x{state[i]} = y{state[j]} then
[i = state[i] + 1, j = state[j] + 1, matches = state[matches] + 2]
else if x{state[i]} < y{state[j]} then
[i = state[i] + 1, j = state[j], matches = state[matches]]
else
[i = state[i], j = state[j] + 1, matches = state[matches]]
else
state
)[matches]
in
matches
From what I can see, both functions give identical outputs, and the second definitely feels faster, though this means they also both give that undercounting issue.
The only thing that comes to mind is that the java code I adapted makes use of the binary representation of the letter pairs to compare them whereas I'm unsure as to how M is comparing the letter pairs in my functions.
Any help would be greatly appreciated!
let list1= {"He","el","ll","lo"}, //alternatively Source[Column1]
list2={"He","el","lo"} , //alternatively Source[Column2]
matching_list = List.Intersect({list1,list2}), // list of matches
count_matching=List.Count(List.Intersect({list1,list2})) // count of matches
in count_matching