I'm working on a project about the meaning of women in movies. Therefore I'm analyzing movie scripts to get a ratio of spoken words by main male character/main female character.
I'm having a problem to filter the spoken words from the NAMES and the directing instruction.
I thought about regex, but I'm not into it.
For example:
Mia works, photos of Hollywood icons on the wall behind her, as --
CUSTOMER #1
This doesn't taste like almond milk.
MIA
Don't worry, it is. I know sometimes it --
CUSTOMER #1
Can I see the carton?
Mia hands it over. The Customer looks.
CUSTOMER #1 (CONT'D)
I'll have a black coffee.
I've no idea what to do with the blank new line after the spoken text. Any ideas how to reduce the complete movie script to an only dialogue script, where I can count the words and work with the data?
from nltk.tokenize import word_tokenize
f = open("/...//La_la_land_script.txt", "r")
script = f.read()
I'm loading the movie script into python
def deletebraces (str):
klammerauf = str.find('(')
klammerzu = str.find(')')
while (klammerauf != -1 and klammerzu != -1):
if (klammerauf<klammerzu):
str = str[:klammerauf] + str[klammerzu+1:]
klammerauf = str.find('(')
klammerzu = str.find(')')
return str
This function deletes all brackets
def removing(list):
for i in list:
if i == '?':
list.remove('?')
if i == '!':
list.remove('!')
if i == '.':
list.remove('.')
if i == ',':
list.remove(',')
if i == '...':
list.remove('...')
return list
This function deletes all the other symbols
def countingwords(list):
woerter = 0
for i in list:
woerter = woerter + 1
return woerter;
this function counts the words
script = deletebraces(script)
def wordsspoken(script, name):
a = 0
e = 0
all = -len(name)-1
if script.find(name)==-1:
print("This character does not speak")
Checks whether there is the character with the name
else:
while(a != -1 and e != -1):
a = script.find(name+'\n ') + len(name)
print(a)
temp = script[a:]
t = temp.split("\n")
text = t[1]
print(text)
textlist = word_tokenize(text)
removing(textlist)
more = countingwords(textlist)
all = all + more
script = script[a+e:]
a = script.find(name +'\n ')
temp = script[a:]
e = temp.find(' \n')
Here I try to uncover, but it doesnt work at all
print(name + " sagt " + str(all) + " Wörter.")
f.close()
name = input("Enter name:")
wordsspoken(script, name)
name1 = input("Enter another name:")
wordsspoken(script, name1)
As @AdrianMcCarthy noted, the whitespace in your file has all the information you need to parse out the spoken lines. Here's one way to approach the task in Python:
import codecs
# script.txt contains the sample text you posted
with codecs.open('script.txt', 'r', 'utf8') as f:
# read the file content
f = f.read()
# store all the clean text that's accumulated
spoken_text = ''
# split the file into a list of strings, with each line a member in the list
for line in f.split('\n'):
# split the line into a list of words in the line
words = line.split()
# if there are no words, do nothing
if not words:
continue
# if this line is a person identifier, do nothing
if len(words[0]) > 1 and all([i.isupper() for i in words[0]]):
continue
# if there's a good amount of whitespace to the left, this is a spoken line
if len(line) - len(line.lstrip()) > 4:
spoken_text += line.strip() + ' '
print(spoken_text)