pythonpython-3.xpandaspyodbc

How to call data in batches from select statement and append into dataframe?


I have a file with a sql statement that I am reading into python using pyodbc. The sql statement is just a select statement like so:

select distinct (columns) from table1

However the data I am calling is 30 million rows.

I can do this for smaller tables, and put the information in a dataframe.

Is there anyway to batch the select statement to pull only an X number of rows and append into a dataframe and keep doing this till the end of the 30million records?

code so far:

import os.path
import pandas as pd
import tinys3
import psycopg2
import pyodbc
from datetime import datetime
import uuid
import glob
from os import listdir
from os.path import isfile, join
import time

startTime = datetime.now()

#reading in data for db
server = 'xxxx' 
database = 'xxx' 
username = 'xxx' 
password = 'xxxx' 
driver= '{ODBC Driver 17 for SQL Server}'
cnxn = pyodbc.connect('DRIVER='+driver+';SERVER='+server+';PORT=xxx;DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = cnxn.cursor()
path = "path/to/folder"




for infile in glob.glob( os.path.join(path, '*.sql') ):
    with open(infile, 'r') as myfile:
        sql = myfile.read()
        print(sql)
        myfile.close()
        cursor.execute(sql)

        row = cursor.fetchall()
        columns = [column[0] for column in cursor.description]
        columns = [element.lower() for element in columns]

        df = pd.DataFrame([tuple(t) for t in row])
        df.columns = columns

Solution

  • You can use the fetchmany function:

    cursor.fetchmany([size=cursor.arraysize]) --> list

    Returns a list of remaining rows, containing no more than size rows, used to process results in chunks. The list will be empty when there are no more rows.

    The default for cursor.arraysize is 1 which is no different than calling fetchone().

    A ProgrammingError exception is raised if no SQL has been executed or if it did not return a result set (e.g. was not a SELECT statement).

    This will allow you to pull data in chunks.

    Example of use:

    while True:
        three_rows = cursor.fetchmany(3)
        # every loop cycle, 3 rows are selected
        if not three_rows:
            break
        print(three_rows)
    

    You could also use the fetchone function, to process the data row by row.

    fetchone

    cursor.fetchone() --> Row or None

    Returns the next row or None when no more data is available.