I have a large term-document matrix and want to use the non-negative matrix factorization function matlab offers. The problem is that after the 1st iteration the memory usage rises rapidly and reaches the top (my system has 6GB), and on the other hand the CPU usage levels become very low (about 1%-5%). The whole system behaves like it has crashed and only if you wait for ages does the second iteration finish. (Note that to get good results many more iterations are needed).
Question:
If someone has any experience with this, or has run nnmf() with even larger matrices than mine, I would really like to know how he/she has actually overcome the problem mentioned.
Also: I have done this with a smaller matrix (about 7000x1800) and had no problems. I use sparse matrices because a term-document matrix has a lot of zero elements and this helps to reduce the space needed to be stored. For example at my case the Term-Document matrix has 14608 * 18828 = 275039424
elements and sum(sum(spa~=0)) = 1312582
non zero elements:
>> whos
Name Size Bytes Class Attributes
full 14608x18828 2200315392 double
spa 14608x18828 21151944 double sparse
Something that finally worked:
I checked the nnmf.m
file (algorithm implementation provided by Matlab) and tried to understand the code. There is one variable called 'd' which does the following:d = a - w*h;
and is a full matrix with the same dimensions as 'a' (i.e. the large term-document matrix):
Name Size Bytes Class Attributes
a 14608x18828 21151944 double sparse
d 14608x18828 2200315392 double
...
h 4x18828 602496 double
h0 4x18828 602496 double
...
w 14608x4 467456 double
w0 14608x4 467456 double
To save some memory space, I used clear
to remove this matrix when it is not needed. Part of the old nnmf.m
file:
d = a - w*h;
dnorm = sqrt(sum(sum(d.^2))/nm);
dw = max(max(abs(w-w0) / (sqrteps+max(max(abs(w0))))));
dh = max(max(abs(h-h0) / (sqrteps+max(max(abs(h0))))));
delta = max(dw,dh);
was replaced with this new one:
d = a - w*h;
dnorm = sqrt(sum(sum(d.^2))/nm);
clear d;
dw = max(max(abs(w-w0) / (sqrteps+max(max(abs(w0))))));
dh = max(max(abs(h-h0) / (sqrteps+max(max(abs(h0))))));
delta = max(dw,dh);
clear d
was added there because d
was never used after that. For the term-document matrix that was being used, this worked without causing memory problems.