pythonreadlines

Python Question - How to extract text between {textblock}{/textblock} of a .txt file?


I want to extract the text between {textblock_content} and {/textblock_content}.

With this script below, only the 1st line of the introtext.txt file is going to be extracted and written in a newly created text file. I don't know why the script does not extract also the other lines of the introtext.txt.

f = open("introtext.txt")
r = open("textcontent.txt", "w")
for l in f.readlines():
    if "{textblock_content}" in l:
        pos_text_begin = l.find("{textblock_content}") + 19
        pos_text_end = l.find("{/textblock_content}")
        text = l[pos_text_begin:pos_text_end]
        r.write(text)

f.close()
r.close()

How to solve this problem?


Solution

  • Your code actually working fine, assuming you have begin and end block in your line. But I think this is not what you dreamed of. You can't read multiple blocks in one line, and you can't read block which started and ended in different lines.

    First of all take a look at the object which returned by open function. You can use method read in this class to access whole text. Also take a look at with statements, it can help you to make actions with file easier and safely. And to rewrite your code so it will read something between {textblockcontent} and {\textblockcontent} we should write something like this:

    def get_all_tags_content(
        text: str,
        tag_begin: str = "{textblock_content}",
        tag_end: str = "{/textblock_content}"
    ) -> list[str]:
    
        useful_text = text
        ans = []
    
        # Heavy cicle, needs some optimizations
        # Works in O(len(text) ** 2), we can better
        while tag_begin in useful_text:
            useful_text = useful_text.split(tag_begin, 1)[1]
            if tag_end not in useful_text:
                break
            block_content, useful_text = useful_text.split(tag_end, 1)
            ans.append(block_content)
        return ans
    
    
    with open("introtext.txt", "r") as f:
        with open("textcontent.txt", "w+") as r:
            r.write(str(get_all_tags_content(f.read())))
    

    To write this function efficiently, so it can work with a realy big files on you. In this implementation I have copied our begin text every time out context block appeared, it's not necessary and it's slow down our program (Imagine the situation where you have millions of lines with content {textblock_content}"hello world"{/textblock_content}. In every line we will copy whole text to continue out program). We can use just for loop in this text to avoid copying. Try to solve it yourself