pythonscikit-learncountvectorizer

Python: ValueError on CountVectorizer. The truth value of a Series is ambiguous


I have this dataset and I'm trying to make Bag of Words out of it using sklearn CountVectorizer, but it throws me this error

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

How can I fix this?

Here's my code :

Token = df['Token']
from sklearn.feature_extraction.text import CountVectorizer

count_vector = CountVectorizer(Token)

count_vector.fit(Token)
count_vector.get_feature_names()
doc_array = count_vector.transform(Token).toarray()
doc_array

And this is the full traceback

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-166-56827f6e68fa> in <module>()
----> 1 count_vector.fit(Token)
      2 count_vector.get_feature_names()
      3 doc_array = count_vector.transform(Token).toarray()
      4 doc_array

C:\Users\ACER\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit(self, raw_documents, y)
   1022         self
   1023         """
-> 1024         self.fit_transform(raw_documents)
   1025         return self
   1026 

C:\Users\ACER\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
   1056 
   1057         vocabulary, X = self._count_vocab(raw_documents,
-> 1058                                           self.fixed_vocabulary_)
   1059 
   1060         if self.binary:

C:\Users\ACER\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
    968         for doc in raw_documents:
    969             feature_counter = {}
--> 970             for feature in analyze(doc):
    971                 try:
    972                     feature_idx = vocabulary[feature]

C:\Users\ACER\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in <lambda>(doc)
    350                                                tokenize)
    351             return lambda doc: self._word_ngrams(
--> 352                 tokenize(preprocess(self.decode(doc))), stop_words)
    353 
    354         else:

C:\Users\ACER\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in decode(self, doc)
    130             The string to decode
    131         """
--> 132         if self.input == 'filename':
    133             with open(doc, 'rb') as fh:
    134                 doc = fh.read()

C:\Users\ACER\AppData\Roaming\Python\Python37\site-packages\pandas\core\generic.py in __nonzero__(self)
   1476         raise ValueError("The truth value of a {0} is ambiguous. "
   1477                          "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
-> 1478                          .format(self.__class__.__name__))
   1479 
   1480     __bool__ = __nonzero__

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I've also tried using the clean dataset before I tokenize it, but it also return the same error

EDIT : Here's the df output

                                               Token    Sentimen
0   ['pilihanny', 'sekolah', 'online', 'offlineken...   0
1   ['mentang', 'sekolah', 'online', '']    0
2   ['terimakasih', 'ya', 'umat', 'kristiani', 'be...   0
3   ['sekolah', 'online', 'tambah', 'rantai', 'cov...   0
4   ['kaya', 'kakak', 'kakak', 'ni', 'jd', 'tumbal...   0
5   ['klian', 'sekolah', 'online', '', 'daring', '...   0
6   ['ga', 'raja', 'udh', 'taun', 'ak', 'sekolah',...   0
7   ['wkwkwk', 'biar', 'hemat', 'nder', 'sekolah',...   0
8   ['tau', 'tuh', 'jarang', 'gerak', 'sekolah', '...   0
9   ['saking', 'lam', 'sekolah', 'online', 'sampe'...   0
10  ['sebenernya', 'dalem', 'giat', 'tuh', 'kayak'...   1
11  ['pagi', 'telpon', 'ibuk', 'camer', 'wkwkw', '...   1
12  ['ow', 'ya', 'sekolah', 'tekan', 'banget', 'an...   1
13  ['pd', 'gak', 'takut', 'ya', 'anak', 'tular', ...   1
14  ['efek', 'anak', 'addict', 'teknologi', '', 'o...   1
15  ['pikir', 'pikir', 'mending', 'sekolah', 'onli...   1
16  ['udah', 'nyaman', 'sekolah', 'online', 'gimna...   1
17  ['ajar', '', 'saran', 'ku', 'kalo', 'udh', 'la...   1
18  ['produktif', 'sekolah', 'online', '', 'i', 'c...   1
19  ['online', 'aja', 'yg', 'beli', 'saham', 'bei'...   1
20  ['solusi', '', 'biar', 'anak', 'anak', 'sekola...   2
...

Edit 2 : Added my clean dataset. I've tried passing the 'Preprocess' to CountVectorizer instead but it still returns the same error

    tweet   Sentimen    Preprocess
0   Sampai sekarang pilihanny hanya sekolah online...   0   sampai sekarang pilihanny hanya sekolah online...
1   @shizunecarla mentang-mentang sekolah online :)     0   mentang-mentang sekolah online :)
2   @jehianps Terimakasih Ya Umat Kristiani berkat...   0   terimakasih ya umat kristiani berkat kalian s...
3   sekolah masih online, supaya tidak menambah ra...   0   sekolah masih online, supaya tidak menambah ra...
4   @cirokuchan kayanya seluruh kakak kakak ni jd ...   0   kayanya seluruh kakak kakak ni jd tumbal pas ...
5   klian yang sekolah online / daring waktu pel p...   0   klian yang sekolah online / daring waktu pel p...
6   @jaejenay ga keraja udh setaun ak sekolah onli...   0   ga keraja udh setaun ak sekolah online😭 km do...
7   @subtanyarl Wkwkwk biar lebih hemat lah nder, ...   0   wkwkwk biar lebih hemat lah nder, apalagi sek...
8   @gbiyel Ketauan tuh jarang bergerak, pasti sek...   0   ketauan tuh jarang bergerak, pasti sekolah on...
9   @innerchild_ug saking lamnya sekolah online sa...   0   saking lamnya sekolah online sampe lupa tangg...
10  @schfess Sebenernya didalem kegiatannya tuh ka...   1   sebenernya didalem kegiatannya tuh kayak nege...
11  Terus pagi2 ditelpon sama ibuk camer wkwkw. Mi...   1   terus pagi ditelpon sama ibuk camer wkwkw. min...
12  Ow ya, kalau dilihat2 sekolah disini menekanka...   1   ow ya, kalau dilihat sekolah disini menekankan...
13  @thiyut Kok pd gak takut ya anaknya ketularan ...   1   kok pd gak takut ya anaknya ketularan 😩.\n\ne...
14  @asti_c mungkin efeknya anak lebih addict sama...   1   mungkin efeknya anak lebih addict sama teknol...
15  dipikir pikir mending sekolah online drpd offl...   1   dipikir pikir mending sekolah online drpd offl...
16  Udah nyaman sekolah online gimna dong? T-tapi ...   1   udah nyaman sekolah online gimna dong? t-tapi ...
17  @pixshii buat belajar 😭 saran ku kalo udh lang...  1   buat belajar 😭 saran ku kalo udh langganan le...
18  Malah merasa lebih produktif sejak sekolah onl...   1   malah merasa lebih produktif sejak sekolah onl...
19  @adit_wr @BTannadi @Felicia_Putri online ajala...   1   online ajalah, mana ada lagi yg beli saham k...
20  Makanya harus ada solusi. Membiarkan anak ana...    2   makanya harus ada solusi. membiarkan anak ana...
21  @cccc0123cccc Dari pada main ga jelas. Anak an...   2   dari pada main ga jelas. anak anak di pemukim...
22  Gila nambah2in beban bgt ngurusin sekolah onli...   2   gila nambahin beban bgt ngurusin sekolah onlin...
23  Tapi sepanjang yg terlihat, anak2 sekolah onli...   2   tapi sepanjang yg terlihat, anak sekolah onlin...
24  Belasan tahun kita sering nganggep kalau "seko...   2   belasan tahun kita sering nganggep kalau "seko...
25  dipikiran guru sekolah online gini termasuk li...   2   dipikiran guru sekolah online gini termasuk li...
26  mentang mentang sekolah online, tangga merah t...   2   mentang mentang sekolah online, tangga merah t...
27  @schfess Prinsipku dari kaya gini, terlena sam...   2   prinsipku dari kaya gini, terlena sampe sma. ...
28  Italia sdh 8 bln tatap muka.\nTp khusus SD krn...   2   italia sdh bln tatap muka.\ntp khusus sd krn ...
29  2 hari lagi nak bukak sekolah, setahun cuti ni...   2   hari lagi nak bukak sekolah, setahun cuti ni ...

Edit 4 : Nevermind, @Luke solution is correct, my problem is after the preprocessing phase I saved the 'Token' value to a new csv, and when I load it, the value change from [mentang, sekolah, online, ''] to ['mentang', 'sekolah', 'online', ''] and that's what causing the solution to not work


Solution

  • You'll need to use a custom analyzer when you have lists in each row since the default analyzers all call the preprocessor and tokenizer, but custom analyzers will skip this.

    import pandas as pd
    
    # Create Sample DataFrame
    data = {'Token': [ ['mentang', 'sekolah', 'online', '']], 'Sentiment': [0]}
    df = pd.DataFrame(data)
    
    Token = df['Token']
    from sklearn.feature_extraction.text import CountVectorizer
    
    # Custom Analyzer
    count_vector = CountVectorizer(analyzer=lambda x: x)
    
    count_vector.fit(Token)
    count_vector.get_feature_names() # Results in: ['', 'mentang', 'online', 'sekolah']
    
    doc_array = count_vector.transform(Token).toarray() # Results in: array([[1, 1, 1, 1]])