pythonregexpython-3.xfilesplitting

How to chunk a big file with certain size and condition


I have one large text file. I chunk that file into small files with a certain size. The following is an example I get:

import math
import os

numThread = 4
inputData= 'dir\example.txt'

def chunk_files():
    nline = sum(1 for line in open(inputData,'r', encoding='utf-8', errors='ignore'))
    chunk_size = math.floor(nline/int(numThread ))
    n_thread = int(numThread )
    j = 0
    with open(inputData,'r', encoding='utf-8', errors='ignore') as file_:
        for i, line in enumerate(file_):
            if (i + 1 == j * chunk_size and j != n_thread) or i == nline:
                out.close()
            if i + 1 == 1 or (j != n_thread and i + 1 == j * chunk_size):
                chunk_file = '_raw' + str(j) + '.txt'
                if os.path.isfile(chunk_file):
                    break
                out = open(chunk_file, 'w+', encoding='utf-8', errors='ignore')
                j = j + 1
            if out.closed != True:
                out.write(line)
            if i % 1000 == 0 and i != 0:
                print ('Processing line %i ...' % (i))
         print ('Done.')

This is the example of text inside the text file:

190219 7:05:30 line3 success 
               line3 this is the 1st success process
               line3 this process need 3sec
200219 9:10:10 line2 success 
               line2 this is the 1st success process

due to the chunk size, I gained various forms of split text. like this :

190219 7:05:30 line3 success line3 this is the 1st success process

line3 this process need 3sec 200219 9:10:10 line2 success line2 this is the 1st success process

I need to get split that is followed by datetime with regex reg= re.compile(r"\b(\d{6})(?=\s\d{1,}:\d{2}:\d{2})\b"), like this:

190219 7:05:30 line3 success line3 this is the 1st success process line3 this process need 3sec

200219 9:10:10 line2 success line2 this is the 1st success process

I've tried Python: regex match across file chunk boundaries. But it seems I can not adjust it with my problem.

Can anyone help me to put the regex into chunk_files function? Thanks in advance


Solution

  • I believe, keeping things simpler would help much.

    all_parts = []
    part = []
    for line in l.split('\n'):
        if re.search(r"^\d+\s\d+:\d+:\d+\s", line):
            if part:
                all_parts.append(part)
                part = []
        part.append(line)
    else: 
        all_parts.append(part)
    
    
    print(all_parts)
    
    

    Trying this with your test_str gives out this:

    In [37]: all_parts                                                                                                                                                                                
    Out[37]: 
    [['190219 7:05:30 line3 success ',
      '               line3 this is the 1st success process',
      '               line3 this process need 3sec'],
     ['200219 9:10:10 line2 success ',
      '               line2 this is the 1st success process'],
     ['190219 7:05:30 line3 success ',
      '               line3 this is the 1st success process',
      '               line3 this process need 3sec'],
     ['200219 9:10:10 line2 success ',
      '               line2 this is the 1st success process'],
     ['200219 9:10:10 line2 success ',
      '               line2 this is the 1st success process',
      '               line2 this is the 1st success process',
      '               line2 this is the 1st success process',
      '               line2 this is the 1st success process',
      '               line2 this is the 1st success process',
      '               line2 this is the 1st success process']]
    

    You could then make the code return a generator / iterator where you would easily chunk any sized file and get lists of chunked lines.