[SOLVED] large dataset on Jupyter notebook

large dataset on Jupyter notebook

I try to extract sentiment for very large dataset that consists of more than 606912 instances on Jupyter notebook, but it takes several days and interrupted this my code:

from camel_tools.sentiment import SentimentAnalyzer

sentiment_dataset=pd.DataFrame()
full_text=[]
sa = SentimentAnalyzer("CAMeL-Lab/bert-base-arabic-camelbert-da-sentiment")
full_text =  dataset['clean_text'].tolist()
iter_len = len(full_text)
for e in range(iter_len):
    print("Iterate through list:",full_text[e])
    s = sa.predict(full_text[e])
    sentiments.insert(e, s)
    print("Iterate through sentiments list:",sentiments[e])
dataset['sentiments']=pd.DataFrame(sentiments)

can someone help me to solve this issue or speed up the operations.

Solution

It is not too efficient to proceed one big source dataset in one python instance. My recommendation are:

Version 1. - use our own parallelization

split the big source dataset to smaller parts
run the same code in more instances (processes) for increase parallelization with focus on smaller parts of original dataset
run this code directly from command line

Version 2. - use existing solution for parallelization

install e.g. Apache Spark, Polaris, etc. and use parallel execution in this one
see short performance comparing https://h2oai.github.io/db-benchmark/