matlabcsvimport-from-csvtextscan

read csv file in matlab with rows of different size


I'm going to read a big csv file in matlab which contains rows like this:

1, 0, 1, 0, 1
1, 0, 1, 0, 1, 0, 1, 0, 1
1, 0, 1
1, 0, 1
1, 0, 1, 0, 1
0, 1, 0, 1, 0, 1, 0, 1, 0

For reading big files I'm using textscan however I should define number of expected parameters in each line of text file.

Using csvread helps but it is too slow and seems to be not efficient. Are there any methods to use textscan with uknown number of inputs in each line? or do you have any other suggestion for this situation?


Solution

  • Since you said "Numerical matrix padded with zeros would be good", there is a solution using textscan which can give you that. The catch however is you have to know the maximum number of element a line can have (i.e. the longest line in your file).

    Provided you know that, then a combination of the additional parameters for textscan allow you to read an incomplete line:

    If you set the parameter 'EndOfLine','\r\n', the documentation explains:

    If there are missing values and an end-of-line sequence at the end of the last line in a file, then textscan returns empty values for those fields. This ensures that individual cells in output cell array, C, are the same size.

    So with the example data in your question saved as differentRows.txt, the following code:

    % be sure about this, better to overestimate than underestimate
    maxNumberOfElementPerLine = 10 ;
    
    % build a reading format which can accomodate the longest line
    readFormat = repmat('%f',1,maxNumberOfElementPerLine) ;
    
    fidcsv = fopen('differentRows.txt','r') ;
    
    M = textscan( fidcsv , readFormat , Inf ,...
        'delimiter',',',...
        'EndOfLine','\r\n',...
        'CollectOutput',true) ;
    
    fclose(fidcsv) ;
    M = cell2mat(M) ; % convert to numerical matrix
    

    will return:

    >> M
    M =
         1     0     1     0     1   NaN   NaN   NaN   NaN   NaN
         1     0     1     0     1     0     1     0     1   NaN
         1     0     1   NaN   NaN   NaN   NaN   NaN   NaN   NaN
         1     0     1   NaN   NaN   NaN   NaN   NaN   NaN   NaN
         1     0     1     0     1   NaN   NaN   NaN   NaN   NaN
         0     1     0     1     0     1     0     1     0   NaN
    

    As an alternative, if it makes a significant speed difference, you could import your data into integers instead of double. The trouble with that is NaN is not defined for integers, so you have 2 options:

    just replace the line which define the format specifier with:

    % build a reading format which can accomodate the longest line
    readFormat = repmat('%d',1,maxNumberOfElementPerLine) ;
    

    This will return:

    >> M
    M =
    1   0   1   0   1   0   0   0   0   0
    1   0   1   0   1   0   1   0   1   0
    1   0   1   0   0   0   0   0   0   0
    1   0   1   0   0   0   0   0   0   0
    1   0   1   0   1   0   0   0   0   0
    0   1   0   1   0   1   0   1   0   0
    

    Define a value which you are sure you'll never have in your original data (for quick identification of empty cells), then use the EmptyValue parameter of the textscan function:

    readFormat = repmat('%d',1,maxNumberOfElementPerLine) ;
    DefaultEmptyValue = 99 ; % placeholder for "empty values"
    
    fidcsv = fopen('differentRows.txt','r') ;
    M = textscan( fidcsv , readFormat , Inf ,...
        'delimiter',',',...
        'EndOfLine','\r\n',...
        'CollectOutput',true,...
        'EmptyValue',DefaultEmptyValue) ;
    

    will yield:

    >> M
    M =
    1   0   1   0   1   99  99  99  99  99
    1   0   1   0   1   0   1   0   1   99
    1   0   1   99  99  99  99  99  99  99
    1   0   1   99  99  99  99  99  99  99
    1   0   1   0   1   99  99  99  99  99
    0   1   0   1   0   1   0   1   0   99