pythonpandasdataframejupyter-notebookpre-trained-model

large dataset on Jupyter notebook


I try to extract sentiment for very large dataset that consists of more than 606912 instances on Jupyter notebook, but it takes several days and interrupted this my code:

from camel_tools.sentiment import SentimentAnalyzer

sentiment_dataset=pd.DataFrame()
full_text=[]
sa = SentimentAnalyzer("CAMeL-Lab/bert-base-arabic-camelbert-da-sentiment")
full_text =  dataset['clean_text'].tolist()
iter_len = len(full_text)
for e in range(iter_len):
    print("Iterate through list:",full_text[e])
    s = sa.predict(full_text[e])
    sentiments.insert(e, s)
    print("Iterate through sentiments list:",sentiments[e])
dataset['sentiments']=pd.DataFrame(sentiments)

can someone help me to solve this issue or speed up the operations.


Solution

  • It is not too efficient to proceed one big source dataset in one python instance. My recommendation are:

    Version 1. - use our own parallelization

    Version 2. - use existing solution for parallelization