performancematlabfor-loop

MATLAB - Comparing each row of two matrices using ismember without loop


Say I have two matrices:

A = [1;
     2;
     3;
     4;
     5]

and

B = [1 1 2 1;
     2 3 3 4;
     5 5 5 5;
     1 4 4 4;
     5 5 1 2]

I need a resulting matrix showing how many times an item from A has appeared on the respective row in B, like this:

C = [3;
     1;
     0;
     3;
     2]

Currently, I am doing that using ismember inside of a for loop and adding the result per row. But it is taking forever because of the size of my matrices. Is there any way to do that without a loop?

Editing to show my current code:

for i=1:1:length(A)
    C(i) = sum(ismember(B(i,:),A(i)),2);
end

But, in reality, my matrices have over 500k lines, and I would love to make this code more efficient by removing the loop.


Solution

  • Short Answer

    You can just use ==, which will use implicit expansion to compare each element of A to the entire row of B and output a logical array of matches, then sum across each row to count the matches

    C = sum( A == B, 2 );
    

    Depending slightly on how A and B are generated, it's often a good idea to avoid direct equality comparisons on floating point numbers (see here), so it might be more robust to make sure A and B are within some small tolerance, i.e.

    tol = 1e-8;
    C = sum( abs(A-B) < tol, 2 );
    

    The subtraction here will use implicit expansion the same as == in the first example to propagate A across the columns of B.


    Benchmarking

    Out of interest I've done some benchmarking between four cases

    1. Code exactly as posted in your question, which is a for loop without pre-allocating the output array C so it grows with the loop.
    2. The same code but pre-allocating C as an array of NaN the correct size, this is typically a good way to speed up loops. Speed up vs (1) of ~18%.
    3. The same pre-allocated loop but using == instead of ismember, since you're comparing integers I would expect == to be faster than ismember. Speed up vs (1) of ~62%.
    4. The above code without using loops at all. The == or abs(..)<tol approaches are both similarly fast so I've only shown the latter. Speed up vs (1) of ~94%.

    Results for test case with length(A) = 1e6 and 100 columns in B:

    With loop and ismember:              0.65sec
    With preallocated loop and ismember: 0.53sec
    With preallocated loop and ==:       0.25sec
    With no loop and ==:                 0.04sec
    

    Your mileage may vary on these time savings depending on input matrix sizes, expected number of matches, etc. Full code below for you to test other cases.

    Code:

    N = 1e6;
    A = (1:N).';
    B = randi([1,N],N,1e2);
    
    t = [ 
        timeit( @() withloop(A,B) )
        timeit( @() withloop_prealloc(A,B) )
        timeit( @() withloop_prealloceq(A,B) )
        timeit( @() noloop(A,B) )
        ];
    
    fprintf( 'With loop and ismember:              %.2fsec\n', t(1) );
    fprintf( 'With preallocated loop and ismember: %.2fsec\n', t(2) );
    fprintf( 'With preallocated loop and ==:       %.2fsec\n', t(3) );
    fprintf( 'With no loop and ==:                 %.2fsec\n', t(4) );
    
    function withloop(A,B)
        for i=1:1:length(A)
            C(i) = sum(ismember(B(i,:),A(i)),2);
        end
    end
    function withloop_prealloc(A,B)
        C = NaN(length(A),1);
        for i=1:1:length(A)
            C(i) = sum(ismember(B(i,:),A(i)),2);
        end
    end
    function withloop_prealloceq(A,B)
        C = NaN(length(A),1);
        for i=1:1:length(A)
            C(i) = sum(B(i,:) == A(i),2);
        end
    end
    function noloop(A,B)
        C = sum( abs(A-B)<1e-8, 2 );
    end