matlabcsvscanffopenlow-level-io

How do I import CSV data in to Matlab only using low level I/O commands


I am really struggling to work out how I can import CSV data which contains 9 columns and around 400 rows of data into a table in the Matlab workspace. This would be easy if I was allowed to use the built-in toolboxes Matlab has to offer, but instead, I need to try and complete the task using only commands such as fscanf, fopen, etc. The data itself is in mixed formats for each column. Decimal, float, string, etc. I am also allowed to use CSVread, but I have not managed to get that to work since from my understanding CSVread only works with numeric values.

Here is my code:

>> filename = 'datatext.csv'; %The file in CSV format
>> fid = fopen(filename); 
>> headers = fgetl(fid); %using fgetl to remove the headers and put them in a variable

I have used fgetl to skip the headers line of the file and add them to their own variable, however I am not sure where to go from this point in creating the table. Basically I want to achieve a 400 row by 9 columns table in the workspace of Matlab.

Here is a sample of a few lines of what the text file looks like:

18,8,318,150,3436,11,70,1,sampletext
16,8,304,150,3433,12,70,1,sampletext2

I assume I will have to use built-in conversion functions of some of the cells, which I am allowed to do. I have probably missed some important information in order to receive the correct help, but any help anyone has will be much appreciated. Thankyou.


Solution

  • The lowest level functions to read a file are, (as described in Import Text Data Files with Low-Level I/O):


    In your case, the input file is ascii, not binary, so right away we can drop the last option (fread).

    You're left with fgetl/fgets (used to read a file line by line then parse each line) and fscanf.

    You already have two answers using the line by line approach, so I won't detail this one and rather show you how you can use fscanf (since your data are suitable, they are indeed organised in a formated pattern).

    The advantage of using fscanf, is that provided you use the right formatSpec parameter, the function will be able to read the whole file in one go, instead of iterating line by line. That is true for all the numeric data in your file. We will have to do a second pass for the text elements in your last column.

    Defining the format specifiers:

    first lets define your format specifications. We will use a different format for each pass. The first pass will read all the numeric data, but will skip the text fields, while the second pass will do the opposite, ignore all numeric data and read only the text fields. The '*' character is really useful for that when defining a format specifier:

    DataFormatSpec = repmat('%d,',1,8) ;
    DataFormatSpec = [DataFormatSpec '%*s'] ; % yield:  '%d,%d,%d,%d,%d,%d,%d,%d,%*s'
    
    TextFormatSpec = repmat('%*d,',1,8) ;
    TextFormatSpec = [TextFormatSpec '%s'] ; % yield:  '%*d,%*d,%*d,%*d,%*d,%*d,%*d,%*d,%s'
    

    I've used %d for all columns because in your sample data I did not see the variety of numeric types you mentionned. You can easily substitute this for %f if the data requires it, and you can mix both types without problem in the same specifier (as long as they are all numeric). Just use what makes sense for each column.

    Armed with that, let's get to the file. The data are in a formated pattern after the headerline, so we first need to get past the header line before calling fscanf. We'll do that as you've done:

    %% Open file and retrieve header line
    fid = fopen( filein,'r') ;
    hdr = fgetl(fid) ;              % Retrieve the header line
    DataStartIndex = ftell(fid) ;   % save the starting index of data section
    

    The call to ftell allows us to save the position of the file pointer. After we read the header, the pointer is located right at the beginning of the data section. We save it to be able to rewind the file pointer to the same point for the second reading pass.

    Reading the numeric data:

    This is done extremely rapidly by fscanf in a simple call:

    %% First pass, read the "numeric values"
    dataArray = fscanf( fid , DataFormatSpec , [8 Inf]).' ;
    

    Notice the transpose operator .' at the end of the line. This is because fscanf populate the values it reads in a column major order, but the data in the text file are read in a line major order. The transpose operation at the end is just to have an output array the same dimension than it was in the text file.

    Now dataArray contains all your numeric data:

    >> dataArray
    dataArray =
              18           8         318         150        3436          11          70           1
              16           8         304         150        3433          12          70           1
    

    Reading the text data:

    This is were it gets slightly more complex. fscanf automatically convert text characters to their ascii value. It's easy enough to convert that back to actual characters (using the function char()). The biggest hurdle is that if we read all the texts fields in one go, they will all appear as a sucession of numbers, but there will be no way to know where each string stops and where the next starts. To overcome that, we'll do a reading line by line, but still using fscanf:

    %% Second pass, read the "text" values
    fseek(fid,DataStartIndex,'bof') ;   % Rewind file pointer to the start of data section
    nRows = size(dataArray,1) ;         % How many lines we'll need to read
    textArray = cell(nRows,1) ;         % pre allocate a cell array to receive the text column elements
    
    for iline=1:nRows
        textArray{iline,1} = char( fscanf(fid,TextFormatSpec,1).' ) ;
    end
    
    fclose(fid) ;   % Close file
    

    Notice again the use of the transpose operator .', and the use of char(). Now textArray is a cell array containing all your text fields:

    >> textArray
    textArray = 
        'sampletext'
        'sampletext2'
    

    Regrouping data set:

    Personnally, I would keep the two arrays separate as they are the most optimized containers for each type of data (double array for numeric data, and cell array for an array of strings). However, if you need to regroup them in a single data structure, you can either use a cell array:

    %% Optional, merge data into cell array
    FullArray = [num2cell(dataArray) textArray]
    FullArray = 
        [18]    [8]    [318]    [150]    [3436]    [11]    [70]    [1]    'sampletext' 
        [16]    [8]    [304]    [150]    [3433]    [12]    [70]    [1]    'sampletext2'
    

    Or you could use a table:

    %% Optional, merge data into a table
    T = array2table(dataArray) ;
    T.text = textArray ;
    T.Properties.VariableNames = [cellstr(reshape(sprintf('v%d',1:8),2,[]).') ; {'text'}] ;
    

    Which gives:

    T = 
        v1    v2    v3     v4      v5     v6    v7    v8        text     
        __    __    ___    ___    ____    __    __    __    _____________
        18    8     318    150    3436    11    70    1     'sampletext' 
        16    8     304    150    3433    12    70    1     'sampletext2'
    

    Obviously if you choose the table version, use the variable names you parsed from the header instead of the automatically generated one I used in this example.