I'm going to read a big csv file in matlab which contains rows like this:
1, 0, 1, 0, 1
1, 0, 1, 0, 1, 0, 1, 0, 1
1, 0, 1
1, 0, 1
1, 0, 1, 0, 1
0, 1, 0, 1, 0, 1, 0, 1, 0
For reading big files I'm using textscan
however I should define number of expected parameters in each line of text file.
Using csvread
helps but it is too slow and seems to be not efficient.
Are there any methods to use textscan
with uknown number of inputs in each line? or do you have any other suggestion for this situation?
Since you said "Numerical matrix padded with zeros would be good", there is a solution using textscan
which can give you that. The catch however is you have to know the maximum number of element a line can have (i.e. the longest line in your file).
Provided you know that, then a combination of the additional parameters for textscan
allow you to read an incomplete line:
If you set the parameter 'EndOfLine','\r\n'
, the documentation explains:
If there are missing values and an end-of-line sequence at the end of the last line in a file, then textscan returns empty values for those fields. This ensures that individual cells in output cell array, C, are the same size.
So with the example data in your question saved as differentRows.txt
, the following code:
% be sure about this, better to overestimate than underestimate
maxNumberOfElementPerLine = 10 ;
% build a reading format which can accomodate the longest line
readFormat = repmat('%f',1,maxNumberOfElementPerLine) ;
fidcsv = fopen('differentRows.txt','r') ;
M = textscan( fidcsv , readFormat , Inf ,...
'delimiter',',',...
'EndOfLine','\r\n',...
'CollectOutput',true) ;
fclose(fidcsv) ;
M = cell2mat(M) ; % convert to numerical matrix
will return:
>> M
M =
1 0 1 0 1 NaN NaN NaN NaN NaN
1 0 1 0 1 0 1 0 1 NaN
1 0 1 NaN NaN NaN NaN NaN NaN NaN
1 0 1 NaN NaN NaN NaN NaN NaN NaN
1 0 1 0 1 NaN NaN NaN NaN NaN
0 1 0 1 0 1 0 1 0 NaN
As an alternative, if it makes a significant speed difference, you could import your data into integers instead of double. The trouble with that is NaN
is not defined for integers, so you have 2 options:
0
just replace the line which define the format specifier with:
% build a reading format which can accomodate the longest line
readFormat = repmat('%d',1,maxNumberOfElementPerLine) ;
This will return:
>> M
M =
1 0 1 0 1 0 0 0 0 0
1 0 1 0 1 0 1 0 1 0
1 0 1 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0
1 0 1 0 1 0 0 0 0 0
0 1 0 1 0 1 0 1 0 0
99
)Define a value which you are sure you'll never have in your original data (for quick identification of empty cells), then use the EmptyValue
parameter of the textscan
function:
readFormat = repmat('%d',1,maxNumberOfElementPerLine) ;
DefaultEmptyValue = 99 ; % placeholder for "empty values"
fidcsv = fopen('differentRows.txt','r') ;
M = textscan( fidcsv , readFormat , Inf ,...
'delimiter',',',...
'EndOfLine','\r\n',...
'CollectOutput',true,...
'EmptyValue',DefaultEmptyValue) ;
will yield:
>> M
M =
1 0 1 0 1 99 99 99 99 99
1 0 1 0 1 0 1 0 1 99
1 0 1 99 99 99 99 99 99 99
1 0 1 99 99 99 99 99 99 99
1 0 1 0 1 99 99 99 99 99
0 1 0 1 0 1 0 1 0 99