pythonfilelist-comprehensionstring-iterationvalue-iteration

How to avoid creating unnecessary lists?


I keep coming across situations where I pull some information from a file or wherever, then have to massage the data to the final desired form through several steps. For example:

def insight_pull(file):
    with open(file) as in_f:
        lines = in_f.readlines()

        dirty = [line.split('    ') for line in lines]
        clean = [i[1] for i in dirty]
        cleaner = [[clean[i],clean[i + 1]] for i in range(0, len(clean),2)]
        cleanest = [i[0].split() + i[1].split() for i in cleaner]


        with open("Output_File.txt", "w") as out_f:
            out_f.writelines(' '.join(i) + '\n' for i in cleanest)

As per the example above:

    # Pull raw data from file splitting on '   '.
    dirty = [line.split('    ') for line in lines]

    # Select every 2nd element from each nested list.
    clean = [i[1] for i in dirty]

    # Couple every 2nd element with it's predecessor into a new list.
    cleaner = [[clean[i],clean[i + 1]] for i in range(0, len(clean),2)]

    # Split each entry in cleaner into the final formatted list.
    cleanest = [i[0].split() + i[1].split() for i in cleaner]

Seeing as I can't put all of the edits into one line or loop (since each edit depends on the edit before it), is there a better way to structure code like this?

Apologies if the question is a bit vague. Any input is much appreciated.


Solution

  • Generator expressions

    You are correct in not wanting to create multiple lists. Your list comprehension's create an entire new list, wasting memory, and you are looping over each list!

    @VPfB's idea of using gererators is a good solution if you have other places in your code to reuse the generators. If you don't have a need to reuse generators use, generator expressions.

    Generator expressions are lazy, like generators, so when chained together, as here, the loop will evaluate once at the end, when writelines is called.

    def insight_pull(file):
        with open(file) as in_f:
            dirty = (line.split('    ') for line in in_f)    # Combine with next
            clean = (i[1] for i in dirty)
            cleaner = (pair for pair in zip(clean,clean))    # Redundantly silly
            cleanest = (i[0].split() + i[1].split() for i in cleaner)
    
            # Don't build a single (possibily huge) string with join
            with open("Output_File.txt", "w") as out_f:
                out_f.writelines(' '.join(i) + '\n' for i in cleanest)
    

    Leaving the above as it directly matches your question, You can go further:

    def insight_pull(file):
        with open(file) as in_f:
            clean = (line.split('    ')[0] for line in in_f)
            cleaner = zip(clean,clean)
            cleanest = (i[0].split() + i[1].split() for i in cleaner)
    
            with open("Output_File.txt", "w") as out_f:
                for line in cleanest:
                    out_f.write(line + '\n')