I have a file like the following:
SCN DD1251
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
DD1271 C DD1271 R
DD1351 D DD1351 B
E
SCN DD1271
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
DD1301 T DD1301 A
DD1251 R DD1251 C
SCN DD1301
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
DD1271 A DD1271 T
B
C
D
SCN DD1351
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
A DD1251 D
DD1251 B
C
SCN DD1451
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
A
B
C
SCN DD1601
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
A
B
C
D
SCN GA0101
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
B GC4251 D
GC420A C GA127A S
GA127A T
SCN GA0151
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
C GA0401 R G
GA0201 D GC0051 E H
GA0401 B GA0201 W
GC0051 A
Where the gap between each record has a newline character followed by 81 spaces.
I have created the following regex expression using regex101.com which seems to match the gaps between each record:
\s{81}\n
Combined with the short loop below to open the file and then write each section to a new file:
delimiter_pattern = re.compile(r"\s{81}\n")
with open("Junctions.txt", "r") as f:
i = 1
for line in f:
if delimiter_pattern.match(line) == False:
output = open('%d.txt' % i,'w')
output.write(line)
else:
i+=1
However, instead of outputting, say 2.txt as expected below:
SCN DD1271
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
DD1301 T DD1301 A
DD1251 R DD1251 C
It instead seems to return nothing at all. I have tried modifying the code like so:
with open("Clean-Junction-Links1.txt", "r") as f:
i = 1
output = open('%d.txt' % i,'w')
for line in f:
if delimiter_pattern.match(line) == False:
output.write(line)
else:
i+=1
But this instead returns several hundred blank text files.
What is the issue with my code, and how could I modify it to make it work? Failing that, is there a simpler way to split the file on the blank lines without using regex?
You don't need to use a regex to do this because you can detect the gap between blocks easily by using the string strip()
method.
input_file = 'Clean-Junction-Links1.txt'
with open(input_file, 'r') as file:
i = 0
output = None
for line in file:
if not line.strip(): # Blank line?
if output:
output.close()
output = None
else:
if output is None:
i += 1
print(f'Creating file "{i}.txt"')
output = open(f'{i}.txt','w')
output.write(line)
if output:
output.close()
print('-fini-')
Another, cleaner and more modular, way to implement it would be to divide the processing up into two independent tasks that logically have very little to do with each other:
The first can be implemented as a generator function which iteratively collects and yields groups of lines comprising a record. It's the one named extract_records()
below.
input_file = 'Clean-Junction-Links1.txt'
def extract_records(filename):
with open(filename, 'r') as file:
lines = []
for line in file:
if line.strip(): # Not blank?
lines.append(line)
else:
yield lines
lines = []
if lines:
yield lines
for i, record in enumerate(extract_records(input_file), start=1):
print(f'Creating file {i}.txt')
with open(f'{i}.txt', 'w') as output:
output.write(''.join(record))
print('-fini-')