pythontextreadlineswritelines

Python: how to print lines from a text file that appear between two tags?


I've got a txt file that essentially reads like this:

line
line
line
<tag>
   info
   info
   info
</tag>
<tag>
   info
   info
   info
</tag>
line
line

I want to edit the file such that it writes the info lines (including the tags, which are the same in both instances), and not the other lines. After this I'll export as an xml and upload into Excel as a table.

I've tried two variations so far, with no luck:

1

import re

with open('document.txt') as test:    
    for line in test:
        target = "<tag>(.*?)</tag>"
        res = re.findall(target, str(test))
        test.write(str(res))

This seems to just return an empty list and prints [] at the end of my document.

2

with open('document.txt') as test:
    parsing = False
    for line in test:
        with open('document.txt') as test:
            if line.startswith("<tag>"):
                parsing = True
            elif line.startswith("</tag>"):
                parsing = False
            if parsing==True:
                test.write(line)

This just messes up my document and places various text/tags in weird places

e.g. I started with

i
<tag>j</tag>
k
<tag>l</tag>
m

as a test, and ended up with

mtag>l</tag>
>
k
<tag>l</tag>
m

I'm pretty new to Python (if you couldn't tell) so apologies if there's a pretty easy fix to this.

Thanks in advance.


Solution

  • You could do it like this:

    with open('document.txt', 'r') as file:
        lines = file.readlines()
    
    output = []
    inside_tag = False
    
    for line in lines:
        if line.strip() == '<tag>':
            inside_tag = True
            output.append(line)
            continue
        elif line.strip() == '</tag>':
            inside_tag = False
            output.append(line)
            continue
        elif inside_tag:
            output.append(line)
    
    
    with open('output.xml', 'w') as file:
        file.writelines(output)
    

    The output.xml will contain the following:

    <tag>
       info
       info
       info
    </tag>
    <tag>
       info
       info
       info
    </tag>
    

    If you want to remove the tabs before the info then you can simply use output.append(line.strip() + '\n') instead of output.append(line)