I'm trying to make an ETL (Extract, transform and load) algorithm with python. I got an amazon review database, but when i use the DataFrame.apply() method to apply the function with regex i got the error:
expected string or bytes-like object, got 'float'
The code i've used is the following:
import pandas as pd
import pathlib
#from sqlalchemy import create_engine
import re
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
#Create the pattern for regex ETL process
pattern = re.compile(r"[\u0041-\u1EFF\s]+\s?")
def iterator_func (x):
match = pattern.search(x[1])
return "".join(i for i in match.groups() if i not in stop_words)
try:
#Open the database, create a connection and upload the data to a database after the ETL process.
with open(pathlib.Path("database\\test.csv"), encoding="utf-8") as f:
csv_table = pd.read_csv(f, header=None)
#Remove incorret values from the first index, stop words and ponctuation characters using regex and nltk
csv_table[1] = csv_table.apply(iterator_func)
csv_table[2] = csv_table[2].apply(iterator_func)
Here you can download and check the database: Amazon reviews on kaggle
I've tried to manually iterate over each row, and it works well, but i've noticed that will have serious performance issues.
for x in csv_table.index():
if csv_table.loc[x, 0] != "1" or csv_table.loc[x, 0] != "2":
csv_table.drop(x, inplace=True, erros="ignore")
#TODO: Create a regex function to avoid numbers, pontuations and stop words.
temp_phrase = "".join(i for i in pattern.findall(csv_table.loc[x, 1]) if i not in stop_words)
temp_phrase_two = "".join(i for i in pattern.findall(csv_table.loc[x, 2]) if i not in stop_words)
csv_table.loc[x, 1] = temp_phrase
csv_table.loc[x, 2] = temp_phrase_two
I just tried to convert the type of a column to the correct type and that work fine.
csv_table[1] = csv_table[1].astype("str")