pythonoracle-databaseperformanceexportbulk-operations

How to efficiently extract large dataset from Oracle to a file?


I have a oracle server from where i need to extract data using python into files. These files are used by downstream systems as inputs.

Few technical details: Oracle and Python are running on different server. The database is client hosted while all scripts are running on an AWS RHEL EC2 server. Details of the EC2 instance are highlighted in this screenshot. screenshot.

My Approach To do this i choose Python's Pyodbc library to connect to the Remote Oracle client and extract data using the SQL query. Below is an extract of the code that fetches the data based on SQL Query provided.

def fetch_data_to_file(self,curr,query,dataset):
    try:
        self.logger.info('Executing query: {}'.format(query))
        start = time.time()
        curr.execute(query)
        query_time = time.time()-start
        start = time.time()
        rowcount=0
        with open(dataset,'a+') as f:
            writer = csv.writer(f,delimiter='|')
            writer.writerow([i[0] for i in curr.description])
            self.logger.info('Writing file: {}'.format(dataset))
            while True:
                rows = curr.fetchmany(self.batch_limit)
                self.logger.info('Writing {} rows'.format(self.batch_limit))
                rowcount+=len(rows)
                if not rows:
                    break           

        self.timer.info('{} Query Execution Time: {} seconds'.format(dataset,query_time))
        self.timer.info('{} File Writing Time: {} seconds. at {} rows/second'.format(dataset,time.time()-start,int(rowcount / (time.time()-start))))

        self.logger.info("File written.")
    except Exception as e:
        self.error.info("Error in fetching data.Error: {}".format(e))
        raise SystemExit(1)

The dataset that I am extracting is close to 8GB uncompressed size (close to 35Million rows returned). And it takes the code ~1.5 hours to download the file on my EC2 server. I tested with multiple variations of batch_limits and found 1Million - 2Million to be the optimal size to batch download data however I am unsure if there is something else i can do more efficiently to figure out what my batch size should be.

What else have i looked into I was looking online to figure out ways to write large datasets to files using python, and many suggested using Pandas. I tried to figure that out but failed to do so. Also, it is important that i preserve the data and its datatypes while extracting them to files.

My ask here is: is there anything i can do better to make this code more efficient ? Is Python the best suited language for this? (Please note, i need to be able to automate the jobs whatever language i choose. Going for licensed libraries is a bit difficult at this point due to internal pricing challenges at my firm).

Also, not sure if this helps, but here is a snapshot of my memory usage while the code was downloading data (htop)enter image description here enter image description here


Solution

  • One possibility would be to download the free "SQLcl" utility, basically a Java based SQL-Plus but does much more. Download here. What you can do with SQLcl is drop it on the client machine and use it to extract the data, while also taking care of the delimiter for you. Here I'm setting the delimiter to the pipe symbol. This may be more efficient than trying to do it via Python, and you'd still be able to script it and call it from Python, or wherever.

    $ sqlcl username/password@'<hostname>:<port>/ORACLE_SID'
    > Set sqlformat delimited |
    > Spool <some file to hold the data>
    > Select * from <some table>;
    > Spool off
    

    And the above can easily be dropped into a shell script.

    #!/bin/bash
    
    sqlcl username/password@'<hostname>:<port>/ORACLE_SID' <<EOF
    Set sqlformat delimited |
    Spool <some file to hold the data>
    Select * from <some table>;
    Spool off
    EOF
    

    Example

    sqlcl> Select * from emp;
    "EMPNO"|"ENAME"|"JOB"|"MGR"|"HIREDATE"|"SAL"|"COMM"|"DEPTNO"
    7839|"KING"|"PRESIDENT"||17-NOV-81|5000||10
    7698|"BLAKE"|"MANAGER"|7839|01-MAY-81|2850||30
    7782|"CLARK"|"MANAGER"|7839|09-JUN-81|2450||10
    7566|"JONES"|"MANAGER"|7839|02-APR-81|2975||20
    7788|"SCOTT"|"ANALYST"|7566|09-DEC-82|3000||20
    7902|"FORD"|"ANALYST"|7566|03-DEC-81|3000||20
    7369|"SMITH"|"CLERK"|7902|17-DEC-80|800||20
    7499|"ALLEN"|"SALESMAN"|7698|20-FEB-81|1600|300|30
    7521|"WARD"|"SALESMAN"|7698|22-FEB-81|1250|500|30
    7654|"MARTIN"|"SALESMAN"|7698|28-SEP-81|1250|1400|30
    7844|"TURNER"|"SALESMAN"|7698|08-SEP-81|1500|0|30
    7876|"ADAMS"|"CLERK"|7788|12-JAN-83|1100||20
    7900|"JAMES"|"CLERK"|7698|03-DEC-81|950||30
    7934|"MILLER"|"CLERK"|7782|23-JAN-82|1300||10