I have many texts file with the given format (It won’t be exactly this format line by line; I am showing some parts from one file to understand the general format).
C:\seismo\2008\07\2008-07-03-2055-56S.HP____030
2008 7 3205556 BNJR tc 16.1 f 1.5 s/n 4.0 Q 101 corr -0.89 rms 0.18
2008 7 3205556 BNJR tc 16.1 f 3.0 s/n 2.9 Q 290 corr -0.80 rms 0.20
2008 7 3205556 BNJR tc 16.1 f 8.0 s/n 3.9 Q 695 corr -0.63 rms 0.37
2008 7 3205556 BNJR tc 16.1 f 12.0 s/n 8.1 Q 913 corr -0.67 rms 0.39
2008 7 3205556 BNJR tc 16.1 f 16.0 s/n 5.7 Q 1435 corr -0.58 rms 0.42
C:\seismo\2008\07\2008-07-03-2055-56S.HP____030
2008 7 3205556 BNJR tc 16.1 f 1.5 s/n 7.9 Q 150 corr -0.78 rms 0.19
2008 7 3205556 BNJR tc 16.1 f 3.0 s/n 5.3 Q 190 corr -0.86 rms 0.24
2008 7 3205556 BNJR tc 16.1 f 5.0 s/n 2.3 Q 401 corr -0.64 rms 0.39
2008 7 3205556 BNJR tc 16.1 f 8.0 s/n 3.1 Q 673 corr -0.65 rms 0.37
2008 7 3205556 BNJR tc 16.1 f 16.0 s/n 3.8 Q 1320 corr -0.64 rms 0.39
C:\seismo\2008\07\2008-07-24-1124-44S.HP____012
C:\seismo\2008\07\2008-07-24-1124-44S.HP____012
2008 724112444 BNJR tc 9.3 f 1.5 s/n 2.7 Q 119 corr -0.82 rms 0.21
2008 724112444 BNJR tc 9.3 f 3.0 s/n 2.3 Q 286 corr -0.68 rms 0.29
C:\seismo\2008-10-21-1507-30S.__053
C:\seismo\2008-10-21-1544-56S.__033
C:\seismo\2008-10-21-1544-56S.__033
C:\seismo\2008-10-21-1544-56S.__033
C:\seismo\2008-10-21-1742-39S.NSN___015
C:\seismo\2008-10-21-1742-39S.NSN___015
C:\seismo\2010-11-18-1111-12S.NSN___027
20101118111112 BNJR tc 20.2 f 1.5 s/n 2.6 Q 141 corr -0.79 rms 0.20
20101118111112 BNJR tc 20.2 f 3.0 s/n 6.6 Q 292 corr -0.58 rms 0.37
20101118111112 BNJR tc 20.2 f 5.0 s/n 3.4 Q 894 corr -0.54 rms 0.23
C:\seismo\2011-02-01-2130-40S.NSN___027
C:\seismo\2011-02-04-0333-36S.NSN___027
C:\seismo\2011-02-04-0333-36S.NSN___027
Which is showing the file path of certain files with their content in it, if the file doesn’t have required content, it only shows the path of the file.
The information (variables) I have marked with a red rectangle is the key information I have to search for whether the file is not listed in the above file or not. If it is listed, the content needs to extract too. I am looking for a way to extract the path and its content shown in the file respective to the information I have (red rectangle). While extracting the content I want to specifically extract the columns marked with black rectangle.
I made a function to extract lines/multiple lines with respect to a line containing specific string. Since the content following each path has different number of lines this function seems useless in my problem.
def extract_lines(file,linenumbers,endline=None):
'''Extract a line /multiple lines from a text file
line number should be considered as starting from zero.
'''
with open(file, encoding='utf8') as f:
content = f.readlines()
lines=[]
if ((type(linenumbers) is int) or (all([isinstance(item, int) for item in linenumbers]))):
if type(linenumbers) is list:
for idx,item in enumerate(linenumbers):
lines.append(content[item])
elif ((endline is None) and (type(linenumbers) is int)):
lines.append(content[linenumbers])
elif ((type(endline) is int) and (type(linenumbers) is int)):
for item in np.arange(linenumbers,endline):
lines.append(content[item])
else:
print('Error in linenumbers input')
lines=[s.replace('\t',' ') for s in lines]
lines=[s.strip('\n') for s in lines]
return lines
How to perform with this task using python?
This file has fixed columns, so you need to fetch your data using column numbers.
#0123456789-123456789-123456789-123456789-123456789-123456789-123456789-123456789-
# 2008 7 3205556 BNJR tc 16.1 f 1.5 s/n 4.0 Q 101 corr -0.89 rms 0.18
for ln in open('x.txt'):
# Is this a file line or a data line?
if ln[1] != ' ':
curfile = ln.strip()
else:
# Grab date and time.
dt = ln[2:14].replace(' ','0')
# Grab the 2-digit code.
dc = ln[14:16]
# Grab site code
site = ln[17:21]
# Grab the 'f' code.
f = float(ln[34:39].strip())
# Grab the 'Q' code.
q = int(ln[52:56].strip())
print(f"{dt},{dc},{site},{f},{q}")
Output:
200807032055,56,BNJR,1.5,10
200807032055,56,BNJR,3.0,29
200807032055,56,BNJR,8.0,69
200807032055,56,BNJR,12.0,91
200807032055,56,BNJR,16.0,143
200807032055,56,BNJR,1.5,15
200807032055,56,BNJR,3.0,19
200807032055,56,BNJR,5.0,40
200807032055,56,BNJR,8.0,67
200807032055,56,BNJR,16.0,132
200807241124,44,BNJR,1.5,11
200807241124,44,BNJR,3.0,28
201011181111,12,BNJR,1.5,14
201011181111,12,BNJR,3.0,29
201011181111,12,BNJR,5.0,89