pythontextnltkmovie

Uncover a dialogue of a movie script to count the words spoken by characters


I'm working on a project about the meaning of women in movies. Therefore I'm analyzing movie scripts to get a ratio of spoken words by main male character/main female character.

I'm having a problem to filter the spoken words from the NAMES and the directing instruction.

I thought about regex, but I'm not into it.

For example:

Mia works, photos of Hollywood icons on the wall behind her, as --

                        CUSTOMER #1
           This doesn't taste like almond milk.

                        MIA
           Don't worry, it is. I know sometimes it --

                        CUSTOMER #1
           Can I see the carton?

 Mia hands it over. The Customer looks.

                        CUSTOMER #1 (CONT'D)
           I'll have a black coffee.

I've no idea what to do with the blank new line after the spoken text. Any ideas how to reduce the complete movie script to an only dialogue script, where I can count the words and work with the data?

from nltk.tokenize import word_tokenize

f = open("/...//La_la_land_script.txt", "r")
script = f.read()

I'm loading the movie script into python

def deletebraces (str):
    klammerauf = str.find('(')
    klammerzu = str.find(')')

    while (klammerauf != -1 and klammerzu != -1):

            if (klammerauf<klammerzu):
                str = str[:klammerauf] + str[klammerzu+1:]

            klammerauf = str.find('(')
            klammerzu = str.find(')')
    return str

This function deletes all brackets

def removing(list):
    for i in list:
        if i == '?':
            list.remove('?')
        if i == '!':
            list.remove('!')
        if i == '.':
            list.remove('.')
        if i == ',':
            list.remove(',')
        if i == '...':
            list.remove('...')
    return list

This function deletes all the other symbols

def countingwords(list):
    woerter = 0
    for i in list:
        woerter = woerter + 1
    return woerter;

this function counts the words

script = deletebraces(script)

def wordsspoken(script, name):

    a = 0
    e = 0
    all = -len(name)-1

    if script.find(name)==-1:
        print("This character does not speak")

Checks whether there is the character with the name

    else:
        while(a != -1 and e != -1):

            a = script.find(name+'\n            ') + len(name)
            print(a)
            temp = script[a:]
            t = temp.split("\n")

            text = t[1]

            print(text)
            textlist = word_tokenize(text)

            removing(textlist)                

            more = countingwords(textlist)

            all = all + more

            script = script[a+e:]
            a = script.find(name +'\n           ')
            temp = script[a:]
            e = temp.find(' \n')

Here I try to uncover, but it doesnt work at all

    print(name + " sagt " + str(all) + " Wörter.")

f.close()


name = input("Enter name:")
wordsspoken(script, name)
name1 = input("Enter another name:")
wordsspoken(script, name1)

Solution

  • As @AdrianMcCarthy noted, the whitespace in your file has all the information you need to parse out the spoken lines. Here's one way to approach the task in Python:

    import codecs
    
    # script.txt contains the sample text you posted
    with codecs.open('script.txt', 'r', 'utf8') as f:
    
      # read the file content
      f = f.read()
    
      # store all the clean text that's accumulated
      spoken_text = ''
    
      # split the file into a list of strings, with each line a member in the list
      for line in f.split('\n'):
    
        # split the line into a list of words in the line
        words = line.split()
    
        # if there are no words, do nothing
        if not words:
          continue
    
        # if this line is a person identifier, do nothing
        if len(words[0]) > 1 and all([i.isupper() for i in words[0]]):
          continue
    
        # if there's a good amount of whitespace to the left, this is a spoken line
        if len(line) - len(line.lstrip()) > 4:
          spoken_text += line.strip() + ' '
    
    print(spoken_text)