I am really struggling to work out how I can import CSV data which contains 9 columns and around 400 rows of data into a table in the Matlab workspace. This would be easy if I was allowed to use the built-in toolboxes Matlab has to offer, but instead, I need to try and complete the task using only commands such as fscanf, fopen, etc. The data itself is in mixed formats for each column. Decimal, float, string, etc. I am also allowed to use CSVread, but I have not managed to get that to work since from my understanding CSVread only works with numeric values.
Here is my code:
>> filename = 'datatext.csv'; %The file in CSV format
>> fid = fopen(filename);
>> headers = fgetl(fid); %using fgetl to remove the headers and put them in a variable
I have used fgetl to skip the headers line of the file and add them to their own variable, however I am not sure where to go from this point in creating the table. Basically I want to achieve a 400 row by 9 columns table in the workspace of Matlab.
Here is a sample of a few lines of what the text file looks like:
18,8,318,150,3436,11,70,1,sampletext
16,8,304,150,3433,12,70,1,sampletext2
I assume I will have to use built-in conversion functions of some of the cells, which I am allowed to do. I have probably missed some important information in order to receive the correct help, but any help anyone has will be much appreciated. Thankyou.
The lowest level functions to read a file are, (as described in Import Text Data Files with Low-Level I/O):
fscanf, which reads formatted data in a text or ASCII file; that is, a file you can view in a text editor. For more information, see Reading Data in a Formatted Pattern.
fgetl and fgets, which read one line of a file at a time, where a newline character separates each line. For more information, see Reading Data Line-by-Line.
fread, which reads a stream of data at the byte or bit level. For more information, see Import Binary Data with Low-Level I/O.
In your case, the input file is ascii, not binary, so right away we can drop the last option (fread
).
You're left with fgetl/fgets
(used to read a file line by line then parse each line) and fscanf
.
You already have two answers using the line by line approach, so I won't detail this one and rather show you how you can use fscanf
(since your data are suitable, they are indeed organised in a formated pattern).
The advantage of using fscanf
, is that provided you use the right formatSpec
parameter, the function will be able to read the whole file in one go, instead of iterating line by line.
That is true for all the numeric data in your file. We will have to do a second pass for the text elements in your last column.
first lets define your format specifications. We will use a different format for each pass. The first pass will read all the numeric data, but will skip the text fields, while the second pass will do the opposite, ignore all numeric data and read only the text fields. The '*'
character is really useful for that when defining a format specifier:
DataFormatSpec = repmat('%d,',1,8) ;
DataFormatSpec = [DataFormatSpec '%*s'] ; % yield: '%d,%d,%d,%d,%d,%d,%d,%d,%*s'
TextFormatSpec = repmat('%*d,',1,8) ;
TextFormatSpec = [TextFormatSpec '%s'] ; % yield: '%*d,%*d,%*d,%*d,%*d,%*d,%*d,%*d,%s'
I've used %d
for all columns because in your sample data I did not see the variety of numeric types you mentionned. You can easily substitute this for %f
if the data requires it, and you can mix both types without problem in the same specifier (as long as they are all numeric). Just use what makes sense for each column.
Armed with that, let's get to the file. The data are in a formated pattern after the headerline, so we first need to get past the header line before calling fscanf
. We'll do that as you've done:
%% Open file and retrieve header line
fid = fopen( filein,'r') ;
hdr = fgetl(fid) ; % Retrieve the header line
DataStartIndex = ftell(fid) ; % save the starting index of data section
The call to ftell
allows us to save the position of the file pointer. After we read the header, the pointer is located right at the beginning of the data section. We save it to be able to rewind the file pointer to the same point for the second reading pass.
This is done extremely rapidly by fscanf
in a simple call:
%% First pass, read the "numeric values"
dataArray = fscanf( fid , DataFormatSpec , [8 Inf]).' ;
Notice the transpose operator .'
at the end of the line. This is because fscanf
populate the values it reads in a column major order, but the data in the text file are read in a line major order. The transpose operation at the end is just to have an output array the same dimension than it was in the text file.
Now dataArray
contains all your numeric data:
>> dataArray
dataArray =
18 8 318 150 3436 11 70 1
16 8 304 150 3433 12 70 1
This is were it gets slightly more complex. fscanf
automatically convert text characters to their ascii value. It's easy enough to convert that back to actual characters (using the function char()
). The biggest hurdle is that if we read all the texts fields in one go, they will all appear as a sucession of numbers, but there will be no way to know where each string stops and where the next starts. To overcome that, we'll do a reading line by line, but still using fscanf
:
%% Second pass, read the "text" values
fseek(fid,DataStartIndex,'bof') ; % Rewind file pointer to the start of data section
nRows = size(dataArray,1) ; % How many lines we'll need to read
textArray = cell(nRows,1) ; % pre allocate a cell array to receive the text column elements
for iline=1:nRows
textArray{iline,1} = char( fscanf(fid,TextFormatSpec,1).' ) ;
end
fclose(fid) ; % Close file
Notice again the use of the transpose operator .'
, and the use of char()
. Now textArray
is a cell array containing all your text fields:
>> textArray
textArray =
'sampletext'
'sampletext2'
Personnally, I would keep the two arrays separate as they are the most optimized containers for each type of data (double
array for numeric data, and cell
array for an array of strings). However, if you need to regroup them in a single data structure, you can either use a cell array:
%% Optional, merge data into cell array
FullArray = [num2cell(dataArray) textArray]
FullArray =
[18] [8] [318] [150] [3436] [11] [70] [1] 'sampletext'
[16] [8] [304] [150] [3433] [12] [70] [1] 'sampletext2'
Or you could use a table
:
%% Optional, merge data into a table
T = array2table(dataArray) ;
T.text = textArray ;
T.Properties.VariableNames = [cellstr(reshape(sprintf('v%d',1:8),2,[]).') ; {'text'}] ;
Which gives:
T =
v1 v2 v3 v4 v5 v6 v7 v8 text
__ __ ___ ___ ____ __ __ __ _____________
18 8 318 150 3436 11 70 1 'sampletext'
16 8 304 150 3433 12 70 1 'sampletext2'
Obviously if you choose the table version, use the variable names you parsed from the header instead of the automatically generated one I used in this example.