pythonword-list

List of list of words by Python:


Having a long list of comments (50 by saying) such as this one:

"this was the biggest disappointment of our trip. the restaurant had received some very good reviews, so our expectations were high. the service was slow even though the restaurant was not very full. I had the house salad which could have come out of any sizzler in the us. the keshi yena, although tasty reminded me of barbequed pulled chicken. this restaurant is very overrated".

I want to create a list of list of words retaining sentence tokenization using python.

After removing stopwords I want a result for all 50 comments in which sentence tokens are retained and word tokens are retained into each tokenized sentence. At the end I hope result being similar to:

list(c("disappointment", "trip"), 
     c("restaurant", "received", "good", "reviews", "expectations", "high"), 
     c("service", "slow", "even", "though", "restaurant", "full"),
     c("house", "salad", "come", "us"), 
     c("although", "tasty", "reminded", "pulled"), 
     "restaurant")  

How could I do that in python? Is R a good option in this case? I really will appreciate your help.


Solution

  • If you do not want to create a list of stop words by hand, I would recommend that you use the nltk library in python. It also handles sentence splitting (as opposed to splitting on every period). A sample that parses your sentence might look like this:

    import nltk
    stop_words = set(nltk.corpus.stopwords.words('english'))
    text = "this was the biggest disappointment of our trip. the restaurant had received some very good reviews, so our expectations were high. the service was slow even though the restaurant was not very full. I had the house salad which could have come out of any sizzler in the us. the keshi yena, although tasty reminded me of barbequed pulled chicken. this restaurant is very overrated"
    sentence_detector = nltk.data.load('tokenizers/punkt/english.pickle')
    sentences = sentence_detector.tokenize(text.strip())
    results = []
    for sentence in sentences:
        tokens = nltk.word_tokenize(sentence)
        words = [t.lower() for t in tokens if t.isalnum()]
        not_stop_words = tuple([w for w in words if w not in stop_words])
        results.append(not_stop_words)
    print results
    

    However, note that this does not give the exact same output as listed in your question, but instead looks like this:

    [('biggest', 'disappointment', 'trip'), ('restaurant', 'received', 'good', 'reviews', 'expectations', 'high'), ('service', 'slow', 'even', 'though', 'restaurant', 'full'), ('house', 'salad', 'could', 'come', 'sizzler', 'us'), ('keshi', 'yena', 'although', 'tasty', 'reminded', 'barbequed', 'pulled', 'chicken'), ('restaurant', 'overrated')]
    

    You might need to add some stop words manually in your case if the output needs to look the same.