matlaboctavedelimitertextscan

How to read comma-delimited data with some values using commas between quotes


I have a data file that includes comma-delimited data that I am trying to read into Octave. Most of the data is fine, but some includes numbers between double quotes that use commas between the quotes. Here's a sample section of data:

.123,4.2,"4,123",700,12pie
.34,4.23,602,701,23dj
.4345,4.6,"3,623,234",700,134nfg
.951,68.5,45,699,4lkj

I've been using textscan to read the data (since there's a mix of number and strings), specifying comma delimiters, and that works most of the time, but occasionally the file contains these bigger integers in quotes scattered through that column. I was able to get around one of these quoted numbers earlier in the data file because I knew where it would be, but it wasn't pretty:

sclose = textscan(fid, '%n %n', 1, 'delimiter', ',');
junk = fgetl(fid, 1);
junk = textscan(fid, '%s', 1, 'delimiter', '"');
junk = fgetl(fid, 1);
sopen = textscan(fid, '%n %s', 1, 'delimiter', ',');

I don't care about the data in that column, but because it changes size and sometimes contains the quoted with extra commas that I want to ignore, I'm struggling with how to read/skip it. Any suggestions on how to handle it?

Here's my current (ugly) approach that reads the column as a string, then uses strfind to check for a " within the string. If it's present then it reads another comma-delimited string and repeats the check until the closing " is found and then resumes reading the data.

fid = fopen('sample.txt', 'r');
for k=1:4
  expdata1(k, :) = textscan(fid, '%n %n %s', 1, 'delimiter', ',');  #read first 3 data pts
  qcheck = char(expdata1(k,3));
  idx = strfind(qcheck, '"');  #look for "
  dloc = ftell(fid);
  for l=1:4
    if isempty(idx) #if no " present, continue reading data
      break
    endif
    dloc = ftell(fid);  #save location so can return to next data point
    expdata1(k, 3) = textscan(fid, '%s', 1, 'delimiter', ',');  #if " present, read next comma segment and check for "
    qcheck = char(expdata1(k,3));
    idx = strfind(qcheck, '"');
  endfor
  fseek(fid, dloc);
  expdata2(k, :) = textscan(fid, '%n %s', 1, 'delimiter', ',');
endfor
fclose(fid);

There's gotta be a better way...


Solution

  • I see this has a matlab tag on it, are you using matlab textscan or octave?

    If in matlab, I would suggest using either readmatrix or readtable.

    Also note, the format specifier for quoted string is %q. This should be applicable to both languages even for textscan.

    Putting your sample data in data.csv, the following is possible:

    >> readtable("data.csv", 'Format','%f%f%q%d%s');
    ans =
    
      4×5 table
    
         Var1     Var2        Var3         Var4       Var5   
        ______    ____    _____________    ____    __________
    
         0.123     4.2    {'4,123'    }    700     {'12pie' }
          0.34    4.23    {'602'      }    701     {'23dj'  }
        0.4345     4.6    {'3,623,234'}    700     {'134nfg'}
         0.951    68.5    {'45'       }    699     {'4lkj'  }