pythonnumpyfile-import

Excluding certain rows while importing data with Numpy


I am generating data-sets from experiments. I end up with csv data-sets that are typically are n x 4 dimensional (n rows; n > 1000 and 4 columns). However, due to an artifact of the data-collection process, typically the first couple of rows and the last couple of rows have only 2 or 3 columns. So a data-set looks like:

8,0,4091
8,0,
8,0,4091,14454
10,0,4099,14454
2,0,4094,14454
8,-3,4104,14455
3,0,4100,14455
....
....
14,-1,4094,14723
0,3,4105,14723
7,0,4123,14723
7,
6,-2,4096,
3,2,

As you can see, the first two rows and the last three don't have the 4 columns that I expect. When I try importing this file using np.loadtxt(filename, delimiter = ','), I get an error. Once I remove the rows which have fewer than 4 columns (first 2 rows, and last 3 rows, in this case), the import works fine.

Two questions:

  1. Why doesn't the usual importing work. I am not sure what is the exact error in this importing. In other words, why is not having the same number of columns in all rows a problem?

  2. As a workaround, I know how to ignore the first two rows while importing the files with numpy np.loadtxt(filename, skiprows= 2), but is there a simple way to also select a fixed number of rows at the bottom to ignore?

Note: This is NOT about finding unique rows in a numpy array. Its more about importing csv data that are non-uniform in the number of columns that each row contains.


Solution

  • Your question is similar (duplicate) to Using genfromtxt to import csv data with missing values in numpy

    1) I'm not sure about why this is the default behavior.

    2) Use numpy's genfromtext. For this you'll need to know the correct number of columns in advance.

    data = numpy.genfromtxt('data.csv', delimiter=',', usecols=[0,1,2,3], invalid_raise=False)
    

    Hope this helps!