I have a very large SAS file that won't fit in memory of my server. I simply need to convert to parquet formatted file. To do so, I am reading it in chunks using the chunksize
option of the read_sas
method in pandas. It is mostly working / doing its job. Except, it fails with the following error after a while.
This particular SAS file has 79422642 rows of data. It is not clear why it fails in the middle.
import pandas as pd
filename = 'mysasfile.sas7bdat'
SAS_CHUNK_SIZE = 2000000
sas_chunks = pd.read_sas(filename, chunksize = SAS_CHUNK_SIZE, iterator = True)
for sasDf in sas_chunks:
print(sasDf.shape)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sas7bdat.py", line 340, in __next__
da = self.read(nrows=self.chunksize or 1)
File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sas7bdat.py", line 742, in read
rslt = self._chunk_to_dataframe()
File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sas7bdat.py", line 795, in _chunk_to_dataframe
rslt[name] = pd.Series(self._string_chunk[js, :], index=ix)
File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/series.py", line 461, in __init__
com.require_length_match(data, index)
File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/common.py", line 571, in require_length_match
raise ValueError(
ValueError: Length of values (2000000) does not match length of index (1179974)
I just tested the same logic of the code on a smaller SAS file with fewer rows using a smaller chunk size as follows, and it seems to work fine without any errors, and also handles the last remaining chunk that is smaller than the chunk size parameter:
filename = 'mysmallersasfile.sas7bdat'
SAS_CHUNK_SIZE = 1000
sas_chunks = pd.read_sas(filename, chunksize = SAS_CHUNK_SIZE, iterator = True)
for sasDf in sas_chunks:
print(sasDf.shape)
(1000, 5)
(1000, 5)
(1000, 5)
(1000, 5)
(983, 5)
Perhaps try this code:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
filename = 'mysasfile.sas7bdat'
output_filename = 'output.parquet'
SAS_CHUNK_SIZE = 2000000
writer = None # initialize writer
sas_chunks = pd.read_sas(filename, chunksize=SAS_CHUNK_SIZE, iterator=True)
for i, sasDf in enumerate(sas_chunks):
print(f"Processing chunk {i+1} with shape {sasDf.shape}")
table = pa.Table.from_pandas(sasDf) # convert pandas DF to Arrow table
if writer is None:
# Create new Parquet file with 1st chunk
writer = pq.ParquetWriter(output_filename, table.schema)
writer.write_table(table) # write Arrow Table to Parquet file
if writer:
writer.close()`