pythonsplitkeyword

Split large textfile to multiple files based on a list of keywords in python


I am new to python. I am stuck at my homework. I am trying to split a 10,000 lines of text file into multiple files based on a list of keywords.

input.txt looks something like this:

 Name: Apple
 Type: Fruits
 Description:...

 Name: Orange
 Type: Fruits
 Description:...

 Name: Yellow
 Type: Colour
 Description:...

 Name: Apple
 Type: Fruits
 Description:...

 Name: Orange
 Type: Fruits
 Description:...

 Name: Yellow
 Type: Colour
 Description:...
 

Keywords:

Apple
Orange
Yellow

Expected output files :

Apple.txt

 Type: Fruits
 Description:

0range.txt

 Type: Fruits
 Description:

Yellow.txt

 Type: Colour
 Description:

But my current codes only able to split if the key is 'Apple'. I am not sure how to modify it to a range of keywords.

key = ['Apple']

outfile = None
fno = 0
lno = 0

with open('input.txt') as infile:
    while line := infile.readline():
        lno += 1
        if outfile is None:
            fno += 1
            outfile = open(f'{fno}.txt', 'w')
        outfile.write(line)
        
        if key in line:
            print(f'"{key}" found in line {lno}')
            outfile.close()
            outfile = None
if outfile:
    outfile.close()

Edit: It should print the first record for each keyword.


Solution

  • Here is a somewhat more idiomatic version of your code. It does not hardcode a list of keywords; it simply picks up whatever comes after Name:

    seen = set()
    outfile = None
    
    with open('input.txt') as infile:
        for line in infile:
            if line.startswith(' Name: '):
                keyword = line[len(' Name: '):-1]
                if keyword not in seen:
                    outfile = open(f'{keyword}.txt', 'w')
                    seen.add(keyword)
            if outfile is not None:
                if line.strip() == '':
                    outfile.close()
                    outfile = None
                else:
                    outfile.write(line)
    if outfile is not None:
        outfile.close()
    

    You were never doing anything useful with lno but if you wanted it for some reason, the idiomatic way to get line numbers is

        for lno, line in enumerate(infile, start=1):
    

    Your sample input.txt shows a space at the beginning of each line. If that was incorrectly transcribed, obviously adapt accordingly.