I have a file with a sql statement that I am reading into python using pyodbc
. The sql statement is just a select statement like so:
select distinct (columns) from table1
However the data I am calling is 30 million rows.
I can do this for smaller tables, and put the information in a dataframe.
Is there anyway to batch the select statement to pull only an X number of rows and append into a dataframe and keep doing this till the end of the 30million records?
code so far:
import os.path
import pandas as pd
import tinys3
import psycopg2
import pyodbc
from datetime import datetime
import uuid
import glob
from os import listdir
from os.path import isfile, join
import time
startTime = datetime.now()
#reading in data for db
server = 'xxxx'
database = 'xxx'
username = 'xxx'
password = 'xxxx'
driver= '{ODBC Driver 17 for SQL Server}'
cnxn = pyodbc.connect('DRIVER='+driver+';SERVER='+server+';PORT=xxx;DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = cnxn.cursor()
path = "path/to/folder"
for infile in glob.glob( os.path.join(path, '*.sql') ):
with open(infile, 'r') as myfile:
sql = myfile.read()
print(sql)
myfile.close()
cursor.execute(sql)
row = cursor.fetchall()
columns = [column[0] for column in cursor.description]
columns = [element.lower() for element in columns]
df = pd.DataFrame([tuple(t) for t in row])
df.columns = columns
You can use the fetchmany function:
cursor.fetchmany([size=cursor.arraysize]) --> list
Returns a list of remaining rows, containing no more than size rows, used to process results in chunks. The list will be empty when there are no more rows.
The default for cursor.arraysize is 1 which is no different than calling fetchone().
A ProgrammingError exception is raised if no SQL has been executed or if it did not return a result set (e.g. was not a SELECT statement).
This will allow you to pull data in chunks.
Example of use:
while True:
three_rows = cursor.fetchmany(3)
# every loop cycle, 3 rows are selected
if not three_rows:
break
print(three_rows)
You could also use the fetchone function, to process the data row by row.
fetchone
cursor.fetchone() --> Row or None
Returns the next row or None when no more data is available.