I am trying to read a csv file that is in my S3 bucket. I would like to do some manipulations and then finally convert to a dynamic dataframe and write it back to S3.
This is what I have tried so far:
Pure Python:
Val1=""
Val2=""
cols=[]
width=[]
with open('s3://demo-ETL/read/data.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
for row in readCSV:
print(row)
if ((Val1=="" ) & (Val2=="")):
Val1=row[0]
Val2=row[0]
cols.append(row[1])
width.append(int(row[4]))
else:
continues...
Here I get an error that says it cannot find the file in the directory at all.
Boto3:
import boto3
s3 = boto3.client('s3')
data = s3.get_object(Bucket='demo-ETL', Key='read/data.csv')
contents = data['Body'].read()
print(contents)
for row in content:
if ((Val1=="" ) & (Val2=="")):
Val1=row[0]
Val2=row[0]
cols.append(row[1])
width.append(int(row[4]))
else:
continues...
Here it says index is out of range which is strange because I have 4 comma separated values in the csv file. When I look at the results from the print(contents), I see that its putting each character in a list, instead of it putting each comma separated value in a list.
Is there a better way to read the csv from s3?
I ended up solving this by reading it as a pandas dataframe. I first created an object with boto3, then read the whole object as a pd which I then converted into a list.
s3 = boto3.resource('s3')
bucket = s3.Bucket('demo-ETL')
obj = bucket.Object(key='read/data.csv')
dataFrame = pd.read_csv(obj.get()['Body'])
l = dataFrame.values.tolist()
for i in l:
print(i)