pythonlistnlpturkish

Problems using snowballstemmer for a list of Turkish words in Python


I'm trying to use a library called snowballstemmer in Python, but it seems that it's not working as expected. What could the reason be? Please see my code below.

My data set:

df=[['musteri', 'hizmetlerine', 'cabuk', 'baglaniyorum'],['konuda', 'yardımcı', 'oluyorlar', 
   'islemlerimde']]

I have applied snowballstemmer package and import TurkishStemmer

  from snowballstemmer import TurkishStemmer
  turkStem=TurkishStemmer()
  data_words_nostops=[turkStem.stemWord(word) for word in df]
  data_words_nostops

  [['musteri', 'hizmetlerine', 'cabuk', 'baglaniyorum'],
   ['konuda', 'yardımcı', 'oluyorlar', 'islemlerimde']]

Unfortunately it didn't work. But when I applied it to single words, it works as expected:

 turkStem.stemWord("islemlerimde")
 'islem'

What could be the problem? Any help will be appreciated.

Thank you.


Solution

  • Did you mean to have a list of strings instead of a list of lists containing strings?

    I was able to get the stems for each word when I reformatted your code this way:

    from snowballstemmer import TurkishStemmer
    
    df = [
        'musteri',
        'hizmetlerine',
        'cabuk',
        'baglaniyorum',
        'konuda',
        'yardımcı',
        'oluyorlar',
        'islemlerimde'
    ]
    turkStem = TurkishStemmer()
    data_words_nostops = [turkStem.stemWord(word) for word in df]
    print(data_words_nostops)
    

    If you have a list of lists of strings (lets say its what you've defined as df) and you want to flatten it down to a single list of words, you can do something like this:

    df = [
        ['musteri', 'hizmetlerine', 'cabuk', 'baglaniyorum'],
        ['konuda', 'yardımcı', 'oluyorlar', 'islemlerimde']
    ]
    flattened_df = [item for sublist in df for item in sublist]
    
    # Output:
    # ['musteri', 'hizmetlerine', 'cabuk', 'baglaniyorum', 'konuda', 'yardımcı', 'oluyorlar', 'islemlerimde']
    

    Credit for the above goes to this StackOverflow post.

    Alternatively, you could just correct the looping to address the problem with your original layout:

    df = [
        ['musteri', 'hizmetlerine', 'cabuk', 'baglaniyorum'],
        ['konuda', 'yardımcı', 'oluyorlar', 'islemlerimde']
    ]
    turkStem = TurkishStemmer()
    all_stem_lists = []
    
    for word_group in df:
        output_stems = []
        for word in word_group:
            stem = turkStem.stemWord(word)
            output_stems.append(stem)
        all_stem_lists.append(output_stems)
    
    print(all_stem_lists)