matlabperformancematlab-table

How to pre-allocate a table with non-scalar sized variables?


I was playing around with tables as a replacement for regular numerical arrays for various reasons, when I came across the following challenge: how to (pre-)allocate a table with non-scalar variables?

Given a loop like so:

function A = myfun(...)
N = large number
A = zeros(N,4);

for i = 1:N
   do stuff
   A(i,:) = [scalar, vector];
end

I want to instead return a table with named variables.

I could simply rewrite it to say:

function T = myfun2(...)
N = large number
A = zeros(N,4);

for i = 1:N
   do stuff
   A(i,:) = [scalar, vector];
end
T = table(A(:,1), A(:,2:end),'VariableNames',{'scalar','vector'});

which obviously yields a table with the format:

T =

  N×2 table

    scalar      vector   
    ______    ___________

      0       0    0    0
      0       0    0    0
      0       0    0    0
     ...          ...

Now, if I instead wanted to pre-allocate the output table and update it for every iteration I would try something along the lines of:

function T = myfun3(...)
N = large number
T = table('Size',[N,2],...
       'VariableTypes',{'double','double'},...
       'VariableNames',{'scalar', 'vector'});

for i = 1:N
   do stuff
   T(i,:) = {scalar, vector};
end

The problem with myfun3 is that the format of T is:

T =

  N×2 table

    scalar    vector
    ______    ______

      0         0   
      0         0   
      0         0   

So clearly the variable 'vector' is now scalar instead of an array/vector. Reading from the table documentation it does not seem like the 'size' type pre-allocation can take in array sizes?

Q1: How does one go about pre-allocating a table with non-scalar variables?

Q2: If A in myfun2 is large, is the overhead bad or is this an acceptable solution?

I have concerns that the extra overhead of indexing into/out-of a table are exceedingly large compared to a numerical array that it will adversely effect performance code.

======= EDIT =======

I contacted MathWorks and they confirmed that as of MATLAB R2019b there is no way of achieving Q1 with the size parameter.


Solution

  • You can create the table before the for-loop, then access it by column names:

    function T = myfun2(...)
    N = large number
    A = zeros(N,4);
    T = table(A(:,1), A(:,2:end),'VariableNames',{'scalar','vector'});
    for i = 1:N
       do stuff
       T.scalar(i,:) = scalar_i;
       T.vector(i,:) = vector_i;
       % or in one line: T(i,:) = table(scalar_i, vector_i);
    end
    

    I am not sure that creating a little table each iteration is efficient, so maybe prefer accessing one column at a time.

    NOTE

    As Juhl pointed out in comments, there may be double allocation using temporary objects for creating a table, whereas with the 'Size' argument, you can expect that there is only one chunk of data allocated.

    So let's check this. On my computer, using Matlab 2019a, there is :

    >> memory
    Maximum possible array:       56239 MB (5.897e+10 bytes) *
    

    So I can allocate 56.239e9 / 8 = 7.0299e9 elements in a single array (knowing that doubles are on 8 bytes). Let's round up, and say that I want to create an table with one column of more that a half of this (3.51e9 elements):

    >> T = table(zeros(4e9,1));
    >> memory
    Maximum possible array:       33644 MB (3.528e+10 bytes)
    

    It takes a long time to allocate, but finishes. With 'Size', it is exactly the same:

    >> T = table(zeros(4e9,1));
    >> memory
    Maximum possible array:       33677 MB (3.531e+10 bytes) *
    

    So it appears that we don't have double allocation.

    There is one fun fact: the memory taken by T is less than we can expect. If I try to modify the last element of my table, it appears that it consumes memory up to the expected memory size:

    >> T.Var1(end) = 1;
    >> memory
    Maximum possible array:       27574 MB (2.891e+10 bytes)
    

    DISCLOSURE

    Please note that modifying this kind of table takes time:

    >> tic; T.Var1(end) = 1; toc
    Elapsed time is 33.286967 seconds.
    

    So my conclusion is: work with normal arrays, it is A LOT faster:

    >> tic; T = table('Size', [4e9, 1], 'VariableTypes',{'double'}); toc
    Elapsed time is 15.997680 seconds.
    >> tic; T.Var1(end) = 1; toc
    Elapsed time is 33.286967 seconds.
    >> clear T;
    
    >> tic; A = zeros(4e9,1); toc
    Elapsed time is 0.043366 seconds.
    >> tic; A(end) = 1; toc
    Elapsed time is 0.002430 seconds.
    >> clear A;