pythonnltkstop-wordstfidfvectorizer

Remove Stopwords in French AND English in TfidfVectorizer


I am trying to remove stopwords in French and English in TfidfVectorizer. So far, I've only managed to remove stopwords from the English language. When I try to enter the French language for the stop_words, I get an error message that says it's not built-in.

In fact, I get the following error message:

ValueError: not a built-in stop list: french

I have a text document containing 700 lines of text mixed in French and English.

I am doing a clustering project of these 700 lines using Python. However, a problem arises with my clusters: I am getting clusters full of French stopwords, and this is messing up the efficiency of my clusters.

My question is the following:

Is there any way to add French stopwords or manually update the built-in English stopword list so that I can get rid of these unnecessary words?

Here's the TfidfVectorizer code that contains my stopwords code:

tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                             min_df=0.2, stop_words='english',
                             use_idf=True, tokenizer=tokenize_and_stem, 
ngram_range=(1,3))

The removal of these French stopwords will allow me to have clusters that are representative of the words that are recurring in my document.

For any doubt regarding the relevance of this question, I had asked a similar question last week. However, it is not similar as it does not use TfidfVectorizer.

Any help would be greatly appreciated. Thank you.


Solution

  • You can use good stop words packages from NLTK or Spacy, two super popular NLP libraries for Python. Since achultz has already added the snippet for using stop-words library, I will show how to go about with NLTK or Spacy.

    NLTK:

    from nltk.corpus import stopwords
    
    final_stopwords_list = stopwords.words('english') + stopwords.words('french')
    tfidf_vectorizer = TfidfVectorizer(max_df=0.8,
      max_features=200000,
      min_df=0.2,
      stop_words=final_stopwords_list,
      use_idf=True,
      tokenizer=tokenize_and_stem,
      ngram_range=(1,3))
    

    NLTK will give you 334 stopwords in total.

    Spacy:

    from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop
    from spacy.lang.en.stop_words import STOP_WORDS as en_stop
    
    final_stopwords_list = list(fr_stop) + list(en_stop)
    tfidf_vectorizer = TfidfVectorizer(max_df=0.8,
      max_features=200000,
      min_df=0.2,
      stop_words=final_stopwords_list,
      use_idf=True,
      tokenizer=tokenize_and_stem,
      ngram_range=(1,3))
    

    Spacy gives you 890 stopwords in total.