pythonnltk

Discovering Poetic Form with NLTK and CMU Dict


Edit: This code has been worked on and released as a basic module: https://github.com/hyperreality/Poetry-Tools

I'm a linguist who has recently picked up python and I'm working on a project which hopes to automatically analyze poems, including detecting the form of the poem. I.e. if it found a 10 syllable line with 0101010101 stress pattern, it would declare that it's iambic pentameter. A poem with 5-7-5 syllable pattern would be a haiku.

I'm using the following code, part of a larger script, but I have a number of problems which are listed below the program:

corpus in the script is simply the raw text input of the poem.

import sys, getopt, nltk, re, string
from nltk.tokenize import RegexpTokenizer
from nltk.util import bigrams, trigrams
from nltk.corpus import cmudict
from curses.ascii import isdigit

...

def cmuform():
    tokens = [word for sent in nltk.sent_tokenize(corpus) for word in nltk.word_tokenize(sent)]
    d = cmudict.dict()
    text = nltk.Text(tokens)
    words = [w.lower() for w in text]
    regexp = "[A-Za-z]+"
    exp = re.compile(regexp)

    def nsyl(word):
        lowercase = word.lower()
        if lowercase not in d:
                return 0
        else:
            first = [' '.join([str(c) for c in lst]) for lst in max(d[lowercase])]
            second = ''.join(first)
            third = ''.join([i for i in second if i.isdigit()]).replace('2', '1')
            return third 
                #return max([len([y for y in x if isdigit(y[-1])]) for x in d[lowercase]])      

    sum1 = 0
    for a in words:
            if exp.match(a):
            print a,nsyl(a),
                sum1 = sum1 + len(str(nsyl(a)))

    print "\nTotal syllables:",sum1

I guess that the output that I want would be like this:

1101111101

0101111001

1101010111

The first problem is that I lost the line breaks during the tokenization, and I really need the line breaks to be able to identify form. This should not be too hard to deal with though. The bigger problems are that:


Solution

  • Before you revised your question you asked how to make it less messy. That's a big question, but you might want to use the top-down procedural approach and break your code into functional units:

    1. Split corpus into lines
    2. For each line: find the syllable length and stress pattern.
    3. Classify stress patterns.

    You'll find that the first step is a single function call in python:

    corpus.split("\n");
    

    and can remain in the main function but the second step would be better placed in its own function and the third step would require to be split up itself, and would probably be better tackled with an object oriented approach. If you're in academy you might be able to convince the CS faculty to lend you a post-grad for a couple of months and help you instead of some workshop requirement.

    Now to your other questions:

    Not loosing line breaks: as @ykaganovich mentioned, you probably want to split the corpus into lines and feed those to the tokenizer.

    Words not in dictionary/errors: The CMU dictionary home page says:

    Find an error? Please contact the developers. We will look at the problem and improve the dictionary. (See at bottom for contact information.)

    There is probably a way to add custom words to the dictionary / change existing ones, look in their site, or contact the dictionary maintainers directly. You can also ask here in a separate question if you can't figure it out. There's bound to be someone in stackoverflow that knows the answer or can point you to the correct resource. Whatever you decide, you'll want to contact the maintainers and offer them any extra words and corrections anyway to improve the dictionary.

    Classifying input corpus when it doesn't exactly match the pattern: You might want to look at the link ykaganovich provided for fuzzy string comparisons. Some algorithms to look for: