I was playing around with tables
as a replacement for regular numerical arrays for various reasons, when I came across the following challenge: how to (pre-)allocate a table with non-scalar variables?
Given a loop like so:
function A = myfun(...)
N = large number
A = zeros(N,4);
for i = 1:N
do stuff
A(i,:) = [scalar, vector];
end
I want to instead return a table with named variables.
I could simply rewrite it to say:
function T = myfun2(...)
N = large number
A = zeros(N,4);
for i = 1:N
do stuff
A(i,:) = [scalar, vector];
end
T = table(A(:,1), A(:,2:end),'VariableNames',{'scalar','vector'});
which obviously yields a table with the format:
T =
N×2 table
scalar vector
______ ___________
0 0 0 0
0 0 0 0
0 0 0 0
... ...
Now, if I instead wanted to pre-allocate the output table and update it for every iteration I would try something along the lines of:
function T = myfun3(...)
N = large number
T = table('Size',[N,2],...
'VariableTypes',{'double','double'},...
'VariableNames',{'scalar', 'vector'});
for i = 1:N
do stuff
T(i,:) = {scalar, vector};
end
The problem with myfun3
is that the format of T is:
T =
N×2 table
scalar vector
______ ______
0 0
0 0
0 0
So clearly the variable 'vector' is now scalar instead of an array/vector. Reading from the table
documentation it does not seem like the 'size' type pre-allocation can take in array sizes?
Q1: How does one go about pre-allocating a table
with non-scalar variables?
Q2: If A in myfun2
is large, is the overhead bad or is this an acceptable solution?
I have concerns that the extra overhead of indexing into/out-of a table are exceedingly large compared to a numerical array that it will adversely effect performance code.
======= EDIT =======
I contacted MathWorks and they confirmed that as of MATLAB R2019b there is no way of achieving Q1 with the size
parameter.
You can create the table before the for-loop, then access it by column names:
function T = myfun2(...)
N = large number
A = zeros(N,4);
T = table(A(:,1), A(:,2:end),'VariableNames',{'scalar','vector'});
for i = 1:N
do stuff
T.scalar(i,:) = scalar_i;
T.vector(i,:) = vector_i;
% or in one line: T(i,:) = table(scalar_i, vector_i);
end
I am not sure that creating a little table each iteration is efficient, so maybe prefer accessing one column at a time.
NOTE
As Juhl pointed out in comments, there may be double allocation using temporary objects for creating a table, whereas with the 'Size' argument, you can expect that there is only one chunk of data allocated.
So let's check this. On my computer, using Matlab 2019a, there is :
>> memory
Maximum possible array: 56239 MB (5.897e+10 bytes) *
So I can allocate 56.239e9 / 8 = 7.0299e9 elements in a single array (knowing that doubles are on 8 bytes). Let's round up, and say that I want to create an table with one column of more that a half of this (3.51e9 elements):
>> T = table(zeros(4e9,1));
>> memory
Maximum possible array: 33644 MB (3.528e+10 bytes)
It takes a long time to allocate, but finishes. With 'Size', it is exactly the same:
>> T = table(zeros(4e9,1));
>> memory
Maximum possible array: 33677 MB (3.531e+10 bytes) *
So it appears that we don't have double allocation.
There is one fun fact: the memory taken by T
is less than we can expect. If I try to modify the last element of my table, it appears that it consumes memory up to the expected memory size:
>> T.Var1(end) = 1;
>> memory
Maximum possible array: 27574 MB (2.891e+10 bytes)
DISCLOSURE
Please note that modifying this kind of table takes time:
>> tic; T.Var1(end) = 1; toc
Elapsed time is 33.286967 seconds.
So my conclusion is: work with normal arrays, it is A LOT faster:
>> tic; T = table('Size', [4e9, 1], 'VariableTypes',{'double'}); toc
Elapsed time is 15.997680 seconds.
>> tic; T.Var1(end) = 1; toc
Elapsed time is 33.286967 seconds.
>> clear T;
>> tic; A = zeros(4e9,1); toc
Elapsed time is 0.043366 seconds.
>> tic; A(end) = 1; toc
Elapsed time is 0.002430 seconds.
>> clear A;