pythonawkmbox

AWK to Python For Mbox


What would be the best Pythonic way of implementing this awk command in python?

awk 'BEGIN{chunk=0} /^From /{msgs++;if(msgs==500){msgs=0;chunk++}}{print > "chunk_" chunk ".txt"}' mbox

I'm using this now to split up enormous mailbox (mbox format) files.

I'm trying a recursive method right now.

def chunkUp(mbox, chunk=0):
    with open(mbox, 'r') as bigfile:
        msg = 0
        for line in bigfile:
            if msg == 0: 
                with open("./TestChunks/chunks/chunk_"+str(chunk)+".txt", "a+") as cf:
                    if line.startswith("From "): msg += 1
                    cf.write(line)
                    if msg > 20: chunkUp(mbox, chunk+1)

I would love to be able to implement this in python and be able to resume progress if it is interrupted. Working on that bit now.

I'm tying my brain into knots! Cheers!


Solution

  • your recursive approach is doomed to fail: you may end up having too many open files at once, since the with blocks don't exit until the end of the program.

    Better have one handle open and write to it, close & reopen new handle when "From" is encountered.

    also open your files in write mode, not append. The code below tries to do the minimal operations & tests to write each line in a file, and close/open another file when From: is found. Also, in the end, the last file is closed.

    def chunkUp(mbox):
        with open(mbox, 'r') as bigfile:
            handle = None
            chunk = 0
    
            for line in bigfile:
                if line.startswith("From "):
                     # next (or first) file
                     chunk += 1
                     if handle is not None:
                        handle.close()
                     handle = None
    
                # file was closed / first file: create a new one
                if handle is None:
                   handle = open("./TestChunks/chunks/chunk_{}.txt".format(chunk), "w")
                # write the line in the current file
                handle.write(line)
    
             if handle is not None:
                 handle.close()
    

    I haven't tested it, but it's simple enough, it should work. If file doesn't have "From" in the first line, all lines before are stored in chunk_0.txt file.