python-3.xcsvnlpstem

how to stem each row in csv file?


I have a CSV file with two columns contains sentence. for example Test.csv:

Col[1]
----------------------
This trip was amazing.

Col[2]
--------------------
The cats are playing.

so I did some nlp process:

with codecs.open('test.csv','r', encoding='utf-8', errors='ignore') as myfile:
     data = csv.reader(myfile, delimiter=',')
     next(data)
     stops = set(stopwords.words("english"))
     stemmer = PorterStemmer()
     for row in data:
        word_tokens1 = word_tokenize(row[1].lower())
        word_tokens2 = word_tokenize(row[2].lower())
        remo1 = [w for w in word_tokens1 if w in re.sub("[^a-zA-Z]"," ",w )]
        remo2 = [w for w in word_tokens2 if w in re.sub("[^a-zA-Z]"," ",w)]
        list1 = [w for w in remo1 if not w in stops]
        list2 = [w for w in remo2 if not w in stops]
        for w in list1:
           l = stemmer.stem(w)
           print(l)
        for w in list2:
           l2 = stemmer.stem(w)
           print(l2)

my problem is when I do stemming, and when I print it:

trip
amazi
cat 
play

it print each word in a row. how can I return to the sentence after stemming like:

Col[1]:
-------------------
trip amazi

Col[2]:
------------------- 
cat play

Solution

  • Here is a modified version of your code that produces the output that you want. The most important thing that you had to do was changing

    for w in list1:
               l = stemmer.stem(w)
               print(l)
            for w in list2:
               l2 = stemmer.stem(w)
               print(l2)
    

    to

    stemmed_first = ""
                c = 0
                for w in list1:
                    if c < len(list1)-1:
                        stemmed_first += stemmer.stem(w) + " "
                    else:
                        stemmed_first += stemmer.stem(w)
                    c += 1
    

    and the same for list2. However, I made other small changes throughout your code:

    stemmer = PorterStemmer()
    stops = set(stopwords.words("english"))
    
    with open('test.csv', 'rb') as csvfile:
        spamreader = csv.reader(csvfile, delimiter=',')
    
        for row in spamreader:
            if len(row) >= 2:
                word_tokens1 = nltk.tokenize.word_tokenize(row[0])
                word_tokens2 = nltk.tokenize.word_tokenize(row[1])
                remo1 = [w for w in word_tokens1 if w in re.sub("[^a-zA-Z]", " ", w)]
                remo2 = [w for w in word_tokens2 if w in re.sub("[^a-zA-Z]", " ", w)]
                list1 = [w for w in remo1 if not w in stops]
                list2 = [w for w in remo2 if not w in stops]
    
                stemmed_first = ""
                c = 0
    
                for w in list1:
                    if c < len(list1)-1:
                        stemmed_first += stemmer.stem(w) + " "
                    else:
                        stemmed_first += stemmer.stem(w)
                    c += 1
    
                stemmed_second = ""
                c = 0
    
                for w in list2:
                    if c < len(list2)-1:
                        stemmed_second += stemmer.stem(w) + " "
                    else:
                        stemmed_second += stemmer.stem(w)
                    c += 1
    
                print stemmed_first
                print stemmed_second