I have a oracle server from where i need to extract data using python into files. These files are used by downstream systems as inputs.
Few technical details: Oracle and Python are running on different server. The database is client hosted while all scripts are running on an AWS RHEL EC2 server. Details of the EC2 instance are highlighted in this screenshot. .
My Approach To do this i choose Python's Pyodbc library to connect to the Remote Oracle client and extract data using the SQL query. Below is an extract of the code that fetches the data based on SQL Query provided.
def fetch_data_to_file(self,curr,query,dataset):
try:
self.logger.info('Executing query: {}'.format(query))
start = time.time()
curr.execute(query)
query_time = time.time()-start
start = time.time()
rowcount=0
with open(dataset,'a+') as f:
writer = csv.writer(f,delimiter='|')
writer.writerow([i[0] for i in curr.description])
self.logger.info('Writing file: {}'.format(dataset))
while True:
rows = curr.fetchmany(self.batch_limit)
self.logger.info('Writing {} rows'.format(self.batch_limit))
rowcount+=len(rows)
if not rows:
break
self.timer.info('{} Query Execution Time: {} seconds'.format(dataset,query_time))
self.timer.info('{} File Writing Time: {} seconds. at {} rows/second'.format(dataset,time.time()-start,int(rowcount / (time.time()-start))))
self.logger.info("File written.")
except Exception as e:
self.error.info("Error in fetching data.Error: {}".format(e))
raise SystemExit(1)
The dataset that I am extracting is close to 8GB uncompressed size (close to 35Million rows returned). And it takes the code ~1.5 hours to download the file on my EC2 server. I tested with multiple variations of batch_limits and found 1Million - 2Million to be the optimal size to batch download data however I am unsure if there is something else i can do more efficiently to figure out what my batch size should be.
What else have i looked into I was looking online to figure out ways to write large datasets to files using python, and many suggested using Pandas. I tried to figure that out but failed to do so. Also, it is important that i preserve the data and its datatypes while extracting them to files.
My ask here is: is there anything i can do better to make this code more efficient ? Is Python the best suited language for this? (Please note, i need to be able to automate the jobs whatever language i choose. Going for licensed libraries is a bit difficult at this point due to internal pricing challenges at my firm).
Also, not sure if this helps, but here is a snapshot of my memory usage while the code was downloading data (htop)
One possibility would be to download the free "SQLcl" utility, basically a Java based SQL-Plus but does much more. Download here. What you can do with SQLcl is drop it on the client machine and use it to extract the data, while also taking care of the delimiter for you. Here I'm setting the delimiter to the pipe symbol. This may be more efficient than trying to do it via Python, and you'd still be able to script it and call it from Python, or wherever.
$ sqlcl username/password@'<hostname>:<port>/ORACLE_SID'
> Set sqlformat delimited |
> Spool <some file to hold the data>
> Select * from <some table>;
> Spool off
And the above can easily be dropped into a shell script.
#!/bin/bash
sqlcl username/password@'<hostname>:<port>/ORACLE_SID' <<EOF
Set sqlformat delimited |
Spool <some file to hold the data>
Select * from <some table>;
Spool off
EOF
Example
sqlcl> Select * from emp;
"EMPNO"|"ENAME"|"JOB"|"MGR"|"HIREDATE"|"SAL"|"COMM"|"DEPTNO"
7839|"KING"|"PRESIDENT"||17-NOV-81|5000||10
7698|"BLAKE"|"MANAGER"|7839|01-MAY-81|2850||30
7782|"CLARK"|"MANAGER"|7839|09-JUN-81|2450||10
7566|"JONES"|"MANAGER"|7839|02-APR-81|2975||20
7788|"SCOTT"|"ANALYST"|7566|09-DEC-82|3000||20
7902|"FORD"|"ANALYST"|7566|03-DEC-81|3000||20
7369|"SMITH"|"CLERK"|7902|17-DEC-80|800||20
7499|"ALLEN"|"SALESMAN"|7698|20-FEB-81|1600|300|30
7521|"WARD"|"SALESMAN"|7698|22-FEB-81|1250|500|30
7654|"MARTIN"|"SALESMAN"|7698|28-SEP-81|1250|1400|30
7844|"TURNER"|"SALESMAN"|7698|08-SEP-81|1500|0|30
7876|"ADAMS"|"CLERK"|7788|12-JAN-83|1100||20
7900|"JAMES"|"CLERK"|7698|03-DEC-81|950||30
7934|"MILLER"|"CLERK"|7782|23-JAN-82|1300||10