pythonregexpython-retext-manipulation

How to split a text file into smaller files based on regex pattern?


I have a file like the following:

SCN DD1251       
            UPSTREAM               DOWNSTREAM               FILTER              
          NODE     LINK          NODE    LINK                LINK               
        DD1271      C           DD1271    R                                     
        DD1351      D           DD1351    B                                     
                    E                                                           
                                                                                
SCN DD1271       
            UPSTREAM               DOWNSTREAM               FILTER              
          NODE     LINK          NODE    LINK                LINK               
        DD1301      T           DD1301    A                                     
        DD1251      R           DD1251    C                                     
                                                                                
SCN DD1301       
            UPSTREAM               DOWNSTREAM               FILTER              
          NODE     LINK          NODE    LINK                LINK               
        DD1271      A           DD1271    T                                     
                    B                                                           
                    C                                                           
                    D                                                           
                                                                                
SCN DD1351       
            UPSTREAM               DOWNSTREAM               FILTER              
          NODE     LINK          NODE    LINK                LINK               
                    A           DD1251    D                                     
        DD1251      B                                                           
                    C                                                           
                                                                                
SCN DD1451       
            UPSTREAM               DOWNSTREAM               FILTER              
          NODE     LINK          NODE    LINK                LINK               
                    A                                                           
                    B                                                           
                    C                                                           
                                                                                
SCN DD1601       
            UPSTREAM               DOWNSTREAM               FILTER              
          NODE     LINK          NODE    LINK                LINK               
                    A                                                           
                    B                                                           
                    C                                                           
                    D                                                           
                                                                                
SCN GA0101       
            UPSTREAM               DOWNSTREAM               FILTER              
          NODE     LINK          NODE    LINK                LINK               
                    B           GC4251    D                                     
        GC420A      C           GA127A    S                                     
        GA127A      T                                                           
                                                                                
SCN GA0151       
            UPSTREAM               DOWNSTREAM               FILTER              
          NODE     LINK          NODE    LINK                LINK               
                    C           GA0401    R                   G                 
        GA0201      D           GC0051    E                   H                 
        GA0401      B           GA0201    W                                     
        GC0051      A                                                           

Where the gap between each record has a newline character followed by 81 spaces.

I have created the following regex expression using regex101.com which seems to match the gaps between each record:

\s{81}\n

Combined with the short loop below to open the file and then write each section to a new file:

delimiter_pattern = re.compile(r"\s{81}\n")

with open("Junctions.txt", "r") as f:
    i = 1
    for line in f:
        if delimiter_pattern.match(line) == False:
            output = open('%d.txt' % i,'w')
            output.write(line)
        else:
            i+=1

However, instead of outputting, say 2.txt as expected below:

SCN DD1271
            UPSTREAM               DOWNSTREAM               FILTER
          NODE     LINK          NODE    LINK                LINK
        DD1301      T           DD1301    A
        DD1251      R           DD1251    C

It instead seems to return nothing at all. I have tried modifying the code like so:

with open("Clean-Junction-Links1.txt", "r") as f:
    i = 1
    output = open('%d.txt' % i,'w')
    for line in f:
        if delimiter_pattern.match(line) == False:
            output.write(line)
        else:
            i+=1

But this instead returns several hundred blank text files.

What is the issue with my code, and how could I modify it to make it work? Failing that, is there a simpler way to split the file on the blank lines without using regex?


Solution

  • You don't need to use a regex to do this because you can detect the gap between blocks easily by using the string strip() method.

    input_file = 'Clean-Junction-Links1.txt'
    
    with open(input_file, 'r') as file:
        i = 0
        output = None
    
        for line in file:
            if not line.strip():  # Blank line?
                if output:
                    output.close()
                output = None
            else:
                if output is None:
                    i += 1
                    print(f'Creating file "{i}.txt"')
                    output = open(f'{i}.txt','w')
                output.write(line)
    
        if output:
            output.close()
    
    print('-fini-')
    

    Another, cleaner and more modular, way to implement it would be to divide the processing up into two independent tasks that logically have very little to do with each other:

    1. Reading the file and grouping the lines of each a record together.
    2. Writing each group of lines to a separate file.

    The first can be implemented as a generator function which iteratively collects and yields groups of lines comprising a record. It's the one named extract_records() below.

    input_file = 'Clean-Junction-Links1.txt'
    
    def extract_records(filename):
        with open(filename, 'r') as file:
            lines = []
            for line in file:
                if line.strip():  # Not blank?
                    lines.append(line)
                else:
                    yield lines
                    lines = []
            if lines:
                yield lines
    
    for i, record in enumerate(extract_records(input_file), start=1):
        print(f'Creating file {i}.txt')
        with open(f'{i}.txt', 'w') as output:
            output.write(''.join(record))
    
    print('-fini-')