pythonpandasamazon-s3optimizationamazon-s3-select

Code Optimization on s3 read csv and ingest back to s3 bucket


ddict = defaultdict(set)
    
file_str = query_csv_s3(s3, BUCKET_NAME, filename, sql_exp, use_header)
            #  read CSV to dataframe
            df = pd.read_csv(StringIO(file_str))
            fdf = df.drop_duplicates(subset='cleverTapId', keep='first')
            fdf.dropna(inplace=True)
            col_one_list = fdf['identity'].tolist()
            col_two_list = fdf['cleverTapId'].tolist()
            for k, v in zip(col_one_list, col_two_list):
                ddict[k].add(v)
            for imkey in ddict:
                im_length = len(str(imkey))
                if im_length == 9:
                    if len(ddict[imkey]) == 1:
                        for value in ddict[imkey]:
                            tdict = {imkey:value}
                        write_to_csv(FILE_NAME,tdict)
                    else:
                        ctlist = list(ddict[imkey])
                        snp_dict = {imkey:'|'.join(ctlist)}
                        write_to_csv(SNAP_FILE_NAME, snp_dict)
    
                elif im_length > 0:
                    if len(ddict[imkey]) == 1:
                        for value in ddict[imkey]:
                            fdict = {imkey:value}
                        write_to_csv(FRAUD_FILE_NAME,fdict)
                    else:
                        pass
                        # mult_ct = list(ddict[imkey])
                        # mydict = {imkey:','.join(mult_ct)}
                        # write_to_csv(MY_FILENAME,mydict)
                else:
                    pass

Here is write_to_csv:

def write_to_csv(filename,mdict):
    file_exists = os.path.isfile(filename)
    with open(filename,'a',newline='') as csvfile:
        headers = ['IM No', 'CT ID']
        writer = csv.DictWriter(
            csvfile,
            delimiter=',',
            lineterminator='\n',
            fieldnames=headers
        )
        if not file_exists:
            writer.writeheader()
        for key in mdict:
            writer.writerow({'IM No': key, 'CT ID': mdict[key]})

I'm reading a csv file containing 2 col using s3 select.

I'm generating 1 IM :1 CTID ,one to many and many to many file and uploading it back to an s3 bucket

How can I optimize it more because it's taking 18hrs to process 530 MB file size read from s3 and upload back?


Solution

  • This is essentially a guess, because I can't run your code. The way you write data to your CSV files is extremely inefficient.

    I/O operations to SSDs or Disks are among the more expensive operations in IT. Right now you open a file for each line you want to append, then append it and close the file again. That means for a 530 MB file you're probably doing millions of these expensive operations.

    If you check out the performance tab in task manager you'll probably see a very high disk usage.

    It's much more efficient to buffer a few of these (or all if RAM is big enough) in memory and flush them to disk at the end.

    Roughly like this:

    FRAUD_FILE_CONTENTS = []
    
    # Computation stuff
    
    FRAU_FILE_CONTENTS.append({"my": "dict"})
    
    # After the loop
    
    with open(FRAUD_FILE_NAME, "w"):
        # Write to CSV