amazon-web-servicesamazon-s3boto3aws-glue

What is the best way to read a csv and text file from S3 on AWS glue without having to read it as a Dynamic daataframe?


I am trying to read a csv file that is in my S3 bucket. I would like to do some manipulations and then finally convert to a dynamic dataframe and write it back to S3.

This is what I have tried so far:

Pure Python:

     Val1=""
     Val2=""
     cols=[]
     width=[]
     with open('s3://demo-ETL/read/data.csv') as csvfile:
     readCSV = csv.reader(csvfile, delimiter=',')
     for row in readCSV:
         print(row)
              if ((Val1=="" ) & (Val2=="")):
                 Val1=row[0]
                 Val2=row[0]
                 cols.append(row[1])
                 width.append(int(row[4]))
    else:
         continues...

Here I get an error that says it cannot find the file in the directory at all.

Boto3:

     import boto3

     s3 = boto3.client('s3')
     data = s3.get_object(Bucket='demo-ETL', Key='read/data.csv')
     contents = data['Body'].read()
     print(contents)
          for row in content:
               if ((Val1=="" ) & (Val2=="")):
                 Val1=row[0]
                 Val2=row[0]
                 cols.append(row[1])
                 width.append(int(row[4]))
    else:
    continues...

Here it says index is out of range which is strange because I have 4 comma separated values in the csv file. When I look at the results from the print(contents), I see that its putting each character in a list, instead of it putting each comma separated value in a list.

Is there a better way to read the csv from s3?


Solution

  • I ended up solving this by reading it as a pandas dataframe. I first created an object with boto3, then read the whole object as a pd which I then converted into a list.

           s3 = boto3.resource('s3') 
           bucket = s3.Bucket('demo-ETL')
           obj = bucket.Object(key='read/data.csv') 
           dataFrame = pd.read_csv(obj.get()['Body'])
           l = dataFrame.values.tolist()
               for i in l:
               print(i)