pythonpython-2.7soappython-requestswebex

Python - split files


I currently have a script that requests a file via a requests.post(). The server sends me two files in the same stream. The way I am processing this right now is to save it all as one file, open it again, split the file based on a regex string, save it as a new file, and delete the old one. The file is large enough that I have to stream=True in my requests.post() statement and write it in chunks.

I was hoping that maybe someone knows a better way to issue the post or work with the data coming back so that the files are stored correctly the first time? Or is this the best way to do it?

----Adding current code----

if not os.path.exists(output_path):
    os.makedirs(output_path)

memFile = requests.post(url, data=etree.tostring(etXML), headers=headers, stream=True)

outFile = open('output/tempfile', 'wb')

for chunk in memFile.iter_content(chunk_size=512):
    if chunk:
        outFile.write(chunk)

f = open('output/tempfile', 'rb').read().split('\r\n\r\n')

arf = open('output/recording.arf', 'wb')
arf.write(f[3])
os.remove('output/tempfile')

Solution

  • Okay, I was bored and wanted to figure out the best way to do this. Turns out that my initial way in the comments above was overly complicated (unless considering some scenario where time is absolutely critical, or memory is severely constrained). A buffer is a much simpler way to achieve this, so long as you take two or more blocks at a time. This code emulates the questions scenario for demonstration.

    Note: depending on the regex engine implementation, this is more efficient and requires significantly less str/byte conversions, as using regex requires casting each block of bytes to string. The approach below requires no string conversions, instead operating solely on the bytes returned from request.post(), and in turn writing those same bytes to file, without conversions.

    from pprint import pprint
    
    someString = '''I currently have a script that requests a file via a requests.post(). The server sends me two files in the same stream. The way I am processing this right now is to save it all as one file, open it again, split the file based on a regex string, save it as a new file, and delete the old one. The file is large enough that I have to stream=True in my requests.post() statement and write it in chunks.
    
    I was hoping that maybe someone knows a better way to issue the post or work with the data coming back so that the files are stored correctly the first time? Or is this the best way to do it?'''
    
    n = 16
    # emulate a stream by creating 37 blocks of 16 bytes
    byteBlocks = [bytearray(someString[i:i+n]) for i in range(0, len(someString), n)]
    pprint(byteBlocks)
    
    # this string is present twice, but both times it is split across two bytearrays
    matchBytes = bytearray('requests.post()')
    
    # our buffer
    buff = bytearray()
    
    count = 0
    for bb in byteBlocks:
        buff += bb
        count += 1
    
        # every two blocks
        if (count % 2) == 0:
    
            if count == 2:
                start = 0
            else:
                start = len(matchBytes)
    
            # check the bytes starting from block (n -2 -len(matchBytes)) to (len(buff) -len(matchBytes))
            # this will check all the bytes only once...
            if matchBytes in buff[ ((count-2)*n)-start : len(buff)-len(matchBytes) ]:
                print('Match starting at index:', buff.index(matchBytes), 'ending at:', buff.index(matchBytes)+len(matchBytes))
    

    Update:

    So, given the updated question, this code may remove the need to create a temporary file. I haven't been able to test it exactly, as I don't have a similar response, but you should be able to figure out any bugs yourself.

    Since you aren't actually working with a stream directly, i.e. you're given the finished response object from requests.post(), then you don't have to worry about using chunks in the networking sense. The "chunks" that requests refers to is really it's way of dishing out the bytes, of which it already has all of. You can access the bytes directly using r.raw.read(n) but as far as I can tell, the request object doesn't allow you to see how many bytes there are in "r.raw", thus you're more or less forced to use the "iter_content" method.

    Anyway, this code should copy all the bytes from the request object into a string, then you can search and split that string as before.

    memFile = requests.post(url, data=etree.tostring(etXML), headers=headers, stream=True)
    
    match = '\r\n\r\n'
    data = ''
    
    for chunk in memFile.iter_content(chunk_size=512):
        if chunk:
            data += chunk
    
    f = data.split(match)
    
    arf = open('output/recording.arf', 'wb')
    arf.write(f[3])
    os.remove('output/tempfile')