matlabsparse-matrixterm-document-matrixnmf

MATLAB nnmf() - large term-document matrix - memory and speed issue


I have a large term-document matrix and want to use the non-negative matrix factorization function matlab offers. The problem is that after the 1st iteration the memory usage rises rapidly and reaches the top (my system has 6GB), and on the other hand the CPU usage levels become very low (about 1%-5%). The whole system behaves like it has crashed and only if you wait for ages does the second iteration finish. (Note that to get good results many more iterations are needed).

Question:

If someone has any experience with this, or has run nnmf() with even larger matrices than mine, I would really like to know how he/she has actually overcome the problem mentioned.

Also: I have done this with a smaller matrix (about 7000x1800) and had no problems. I use sparse matrices because a term-document matrix has a lot of zero elements and this helps to reduce the space needed to be stored. For example at my case the Term-Document matrix has 14608 * 18828 = 275039424 elements and sum(sum(spa~=0)) = 1312582 non zero elements:

>> whos
Name          Size                    Bytes  Class     Attributes

full      14608x18828            2200315392  double              
spa       14608x18828              21151944  double    sparse    

Solution

  • Something that finally worked:

    I checked the nnmf.m file (algorithm implementation provided by Matlab) and tried to understand the code. There is one variable called 'd' which does the following:d = a - w*h; and is a full matrix with the same dimensions as 'a' (i.e. the large term-document matrix):

    Name             Size                    Bytes  Class      Attributes
    a            14608x18828              21151944  double     sparse    
    d            14608x18828            2200315392  double               
    ...
    h                4x18828                602496  double               
    h0               4x18828                602496  double               
    ...
    w            14608x4                    467456  double               
    w0           14608x4                    467456  double   
    

    To save some memory space, I used clear to remove this matrix when it is not needed. Part of the old nnmf.m file:

    d = a - w*h;
    dnorm = sqrt(sum(sum(d.^2))/nm);
    dw = max(max(abs(w-w0) / (sqrteps+max(max(abs(w0))))));
    dh = max(max(abs(h-h0) / (sqrteps+max(max(abs(h0))))));
    delta = max(dw,dh);
    

    was replaced with this new one:

    d = a - w*h;
    dnorm = sqrt(sum(sum(d.^2))/nm);
    clear d;
    dw = max(max(abs(w-w0) / (sqrteps+max(max(abs(w0))))));
    dh = max(max(abs(h-h0) / (sqrteps+max(max(abs(h0))))));
    delta = max(dw,dh);
    

    clear d was added there because d was never used after that. For the term-document matrix that was being used, this worked without causing memory problems.