I'm trying to import multiple files to my code in a for loop for analysis, but the files aren't all formatted exactly the same way (and there are too many to manually edit).
The data I need is the same in every file - 13 columns that I import as strings. Below is an example of a file:
could not open XWindow display
could not open XWindow display
No graphics display available for this session.
Graphics tasks that attempt to plot to an interactive screen will fail.
/data/poohbah/2/asassn/be/F0041-70_2645
### JD HJD UT_date IMAGE FWHM Diff Limit mag mag_err counts counts_err flux(mJy) flux_err
2456784.50841 2456784.50816 2014-05-07.0072681 interp_bf002339_coadd 2.61 -2.65 17.031 15.543 0.093 526.82 44.57 2.328 0.197
2456789.45407 2456789.45347 2014-05-11.9529421 interp_be003585_coadd 2.26 -2.31 16.869 15.383 0.093 834.50 70.78 2.695 0.229
2456790.47441 2456790.47419 2014-05-12.9732922 interp_bf004070_coadd 1.72 -2.25 17.246 15.721 0.090 645.67 52.82 1.974 0.162
...
(data continues)
...
2457895.45745 2457895.45919 2017-05-21.9587133 interp_bf305499_coadd 1.71 -2.45 17.299 15.482 0.068 673.31 42.10 2.461 0.154
/data/poohbah/1/assassin/bin/./ap_phot_im_cal_test.py:654: RuntimeWarning: invalid value encountered in sqrt
counts_err_a = np.sqrt( counts_a / options.gain + (area_a * bg_stdev_a **2.0 ) )
/data/poohbah/1/assassin/bin/./ap_phot_im_cal_test.py:369: RuntimeWarning: invalid value encountered in less_equal
no_detected = np.nonzero( (counts <= limit) & (area >= 0.01) )[0]
/data/poohbah/1/assassin/bin/./ap_phot_im_cal_test.py:367: RuntimeWarning: divide by zero encountered in log10
maglimit[notbad] = -2.5 * np.log10(limit[notbad]) + def_zeropt
I only need the data between the '###' line and the '/data' pathway at the end, and in all of the files this section is formatted exactly the same with 13 columns. However the 'comments' at the beginning and end of any particular file could differ. Some do not have the 'could not open XWindow display', others don't have the paths at the end. I have tried ignoring lines that start with '#' or '/', but this does nothing for the very first lines or the ' counts_err_a' and such lines at the end of this particular example.
Is there a way to import data into Python and only take the rows that have a specific number of columns in them? In pseudo code it might look like:
open(file_name)
if column_number = 13
np.genfromtxt(file_name)
else skip
You won't know how many columns there are until you have counted them, so you can filter the file as you read it, but you will still have to split()
the line. Something like below, and you could add in other checks if there are a lot of comments, for example.
saved_lines = []
with open(filename) as f:
for line in f:
if len(line.split()) == 13:
saved_lines.append(line)
Or equivalent as a comprehension
:
with open(filename) as f:
saved lines = [line for line in f if len(line.split()) == 13]