python-3.xdataframevalidationfor-loopopentext

How to separate lines of data read from a textfile? Customers with their orders


I have this data in a text file. (Doesn't have the spacing I added for clarity)

I am using Python3:

orders = open('orders.txt', 'r')
lines = orders.readlines()

I need to loop through the lines variable that contains all the lines of the data and separate the CO lines as I've spaced them. CO are customers and the lines below each CO are the orders that customer placed.

The CO lines tells us how many lines of orders exist if you look at the index[7-9] of the CO string. I illustrating this below.

CO77812002D10212020       <---(002)
125^LO917^11212020.      <----line 1
235^IL993^11252020       <----line 2 

CO77812002S10212020
125^LO917^11212020
235^IL993^11252020

CO95307005D06092019    <---(005)
194^AF977^06292019    <---line 1 
72^L223^07142019       <---line 2
370^IL993^08022019    <---line 3
258^Y337^07072019     <---line 4
253^O261^06182019     <---line 5

CO30950003D06012019
139^LM485^06272019
113^N669^06192019
249^P530^07112019
CO37501001D05252020
479^IL993^06162020

I have thought of a brute force way of doing this but it won't work against much larger datasets.

Any help would be greatly appreciated!


Solution

  • You can use fileinput (source) to "simultaneously" read and modify your file. In fact, the in-place functionality that offers to modify a file while parsing it is implemented through a second backup file. Specifically, as stated here:

    Optional in-place filtering: if the keyword argument inplace=True is passed to fileinput.input() or to the FileInput constructor, the file is moved to a backup file and standard output is directed to the input file (...) by default, the extension is '.bak' and it is deleted when the output file is closed.

    Therefore, you can format your file as specified this way:

    import fileinput
    
    with fileinput.input(files = ['orders.txt'], inplace=True) as orders_file:
        for line in orders_file:
            if line[:2] == 'CO':    # Detect customer line
                orders_counter = 0
                num_of_orders = int(line[7:10])    # Extract number of orders
            else:
                orders_counter += 1
                # If last order for specific customer has been reached
                # append a '\n' character to format it as desired
                if orders_counter == num_of_orders:
                    line += '\n'
            # Since standard output is redirected to the file, print writes in the file
            print(line, end='')
    

    Note: it's supposed that the file with the orders is formatted exactly in the way you specified:

    CO...
    (order_1)
    (order_2)
    ...
    (order_i)
    CO...
    (order_1)
    ...