pythonpython-3.xtext-analysis

Extract text between two delimiters from a text file


I'm currently writing my master thesis about CEO narcissism. In order to measure it, I have to do an earnings calls text analysis. I wrote a code in python, following the answers available in this link, that allows me to extract the Question and Answers section from an earnings calls transcript. The file is like this (it's called 'testoestratto.txt'):

..............................
Delimiter [1]
..............................
A text that I don't need
..............................
Delimiter CEO [2]
..............................
I need this text
..............................
Delimiter [3]
..............................

[...]

..............................
Delimiter CEO [n-1]
..............................
I also need this text
..............................
Delimiter [n]
..............................

I have also another txt file ('lista.txt') where I extracted all the delimiters from the transcript:

Delimiter [1]
Delimiter CEO [2]
Delimiter [3]
[...]
Delimiter CEO [n-1]
Delimiter [n]

What I'd like to do, is to extract the text from 'testoestratto.txt' between Delimiter CEO [2] and Delimiter [3], ..., and between Delimiter CEO [n-1] and Delimiter [n]. The extracted text has to be written in 'test.txt'. So, if a delimiter from 'lista.txt' contains the word CEO, I need the text from 'testoestratto.txt' that is between that particular delimiter and the next delimiter from 'lista.txt' that doesn't have the word 'CEO' in it. In order to do so, I wrote the following code:

with open('testoestratto.txt','r', encoding='UTF-8') as infile, open('test.txt','a', encoding='UTF-8') as outfile, open('lista.txt', 'r', encoding='UTF-8') as mylist:
   text= mylist.readlines()
   text= [frase.strip('\n') for frase in text]
   bucket=[] 
   copy = False
   for i in range(len(text)):
      for line in infile:                         
          if line.strip()==text[i] and text[i].count('CEO')!=0 and text[i].count('CEO')!= -1:                                                          
              copy=True                          
          elif line.strip()== text[i+1] and text[i+1].count('CEO')==0 or text[i+1].count('CEO')==-1:
              for strings in bucket:
                  outfile.write(strings + '\n')
          elif copy:
              bucket.append(line.strip())

However, the 'test.txt' file is empty. Could you help me?

P.S. : I'm a beginner in python, so I'd like to apologize if the code is messy


Solution

  • There are a few things that you need to change in your code.

    Firstly, the key here is to reset the line back to the start of the file after every iteration of reading it once. Since you haven't done this, your code never reads the file from the beginning after the first iteration of the nested for loop. You can do this using infile.seek(0).

    Secondly, you need to reset the value of your flag "copy" to False once you are done writing to the file. This ensures that you don't write the text that you don't need to the file. Additionally, you also need to empty your bucket to avoid writing the same lines multiple times in your output.

    Thirdly, you have included a lot of string checks in the elif statement that are not necessary.

    I have made the changes in the code below:

    with open('testoestratto.txt','r', encoding='UTF-8') as infile, 
    open('test.txt','a', encoding='UTF-8') as outfile, open('lista.txt', 'r', 
    encoding='UTF-8') as mylist:
        text= mylist.readlines()
        text= [frase.strip('\n') for frase in text]
        bucket=[]
        copy = False
        for i in range(len(text)):
            for line in infile:
                if line.strip('\n')==text[i] and text[i].count('CEO') > 0:
                    copy=True
                elif copy and line.strip('\n') == text[i+1]:
                    for strings in bucket:
                        outfile.write(strings + '\n')
                    copy = False
                    bucket = list()
                elif copy:
                    bucket.append(line.strip())
            infile.seek(0)
    

    With that being said, you can also optimize your code. As you can see, this code runs in O(n^3).