pythonfile-import

Only importing rows with a specific number of columns from file in Python


I'm trying to import multiple files to my code in a for loop for analysis, but the files aren't all formatted exactly the same way (and there are too many to manually edit).

The data I need is the same in every file - 13 columns that I import as strings. Below is an example of a file:

could not open XWindow display
could not open XWindow display

No graphics display available for this session.
Graphics tasks that attempt to plot to an interactive screen will fail.

/data/poohbah/2/asassn/be/F0041-70_2645
###  JD        HJD            UT_date             IMAGE    FWHM  Diff Limit      mag    mag_err       counts   counts_err   flux(mJy)     flux_err
2456784.50841  2456784.50816  2014-05-07.0072681  interp_bf002339_coadd 2.61 -2.65 17.031      15.543  0.093          526.82        44.57   2.328        0.197       
2456789.45407  2456789.45347  2014-05-11.9529421  interp_be003585_coadd 2.26 -2.31 16.869      15.383  0.093          834.50        70.78   2.695        0.229       
2456790.47441  2456790.47419  2014-05-12.9732922  interp_bf004070_coadd 1.72 -2.25 17.246      15.721  0.090          645.67        52.82   1.974        0.162       
...
(data continues)
...
2457895.45745  2457895.45919  2017-05-21.9587133  interp_bf305499_coadd 1.71 -2.45 17.299      15.482  0.068          673.31        42.10   2.461        0.154       
/data/poohbah/1/assassin/bin/./ap_phot_im_cal_test.py:654: RuntimeWarning: invalid value encountered in sqrt
  counts_err_a = np.sqrt( counts_a / options.gain + (area_a * bg_stdev_a **2.0 ) )
/data/poohbah/1/assassin/bin/./ap_phot_im_cal_test.py:369: RuntimeWarning: invalid value encountered in less_equal
  no_detected = np.nonzero( (counts <= limit) & (area >= 0.01) )[0]
/data/poohbah/1/assassin/bin/./ap_phot_im_cal_test.py:367: RuntimeWarning: divide by zero encountered in log10
  maglimit[notbad] = -2.5 * np.log10(limit[notbad]) + def_zeropt

I only need the data between the '###' line and the '/data' pathway at the end, and in all of the files this section is formatted exactly the same with 13 columns. However the 'comments' at the beginning and end of any particular file could differ. Some do not have the 'could not open XWindow display', others don't have the paths at the end. I have tried ignoring lines that start with '#' or '/', but this does nothing for the very first lines or the ' counts_err_a' and such lines at the end of this particular example.

Is there a way to import data into Python and only take the rows that have a specific number of columns in them? In pseudo code it might look like:

open(file_name)
 if column_number = 13
   np.genfromtxt(file_name)
 else skip

Solution

  • You won't know how many columns there are until you have counted them, so you can filter the file as you read it, but you will still have to split() the line. Something like below, and you could add in other checks if there are a lot of comments, for example.

    saved_lines = []
    with open(filename) as f:
        for line in f:
            if len(line.split()) == 13:
                saved_lines.append(line)
    

    Or equivalent as a comprehension:

    with open(filename) as f:
        saved lines = [line for line in f if len(line.split()) == 13]