I am aiming to shuffle a csv file too large to load into the RAM of my device at once. My plan was to use csv.DictReader to go through the entire file and write each line randomly into one of a set of smaller files using csv.DictWriter, then shuffle all the smaller files, and then concatenate them again. I cannot however get past the first step, since the smaller files cannot be opened with pandas because it gives the following error: pandas.errors.EmptyDataError: No columns to parse from file
.
To perform the first step, I made a list of csv DictWriter files, one for each small file I want to split the main file into, in order to randomly assign each line of the main file to one of the DictWriter files. Here is a dummy example that illustrates the idea of the first step, but on a synthetically generated csv file, which gives the same error:
import pandas as pd
import random
import numpy as np
import csv
import os
#First, creating a dummy file just containing integers 0-19 over 2 columns.
data=pd.DataFrame({'col1':list(range(10)),'col2':list(range(10,20))})
data.to_csv('test_file.csv',index=False)
n_chunks=2 #For this example I only split the dummy file into two smaller files
#Next, make a list of DictWriter objects, one for each smaller file
file_names=[f"test_batch_{batch_no}.csv" for batch_no in list(range(n_chunks))]
chunks=[csv.DictWriter(open(file_name,'w'),["col1","col2"]) for file_name in file_names]
#Make headers for each smaller file
for chunk in chunks:
chunk.writeheader()
#Now, randomly assign each line in test_file.csv to one of the smaller files.
with open("test_file.csv",newline='') as data:
reader=csv.DictReader(data)
for line in reader:
i=random.randint(0,n_chunks-1)
chunks[i].writerow(line)
for file_name in file_names:
#The next line gives the error.
chunk=pd.read_csv(file_name)
What is strange is that the smaller files are composed entirely of lines directly copied from the main file, which in this example was itself created as a pandas file and can be loaded as a pandas dataframe without any problems. Moreover, when I inspect both of the smaller files that are created ('test_batch_0.csv' and 'test_batch_1.csv') in e.g. notepad, they look like regular csv files to me, but still somehow pd.read_csv cannot read it.
I tried to vary the newline parameters for the 'open' commands, since I found in previous posts that the 'No columns to parse form file' is sometimes caused by incorrect newlines, but to no avail.
For completeness, here is the full error path:
Traceback (most recent call last):
File "[...]/main.py", line 29, in <module>
chunk=pd.read_csv(file_name)
^^^^^^^^^^^^^^^^^^^^^^
File "[...]/.venv/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
return _read(filepath_or_buffer, kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "[...]/.venv/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 620, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "[...]/.venv/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1620, in __init__
self._engine = self._make_engine(f, self.engine)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "[...]/.venv/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1898, in _make_engine
return mapping[engine](f, **self.options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "[...].venv/lib/python3.12/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 93, in __init__
self._reader = parsers.TextReader(src, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "parsers.pyx", line 581, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
Many thanks in advance!
The original open files weren't closed so the changes weren't flushed to the file before reading. Here's the fixed code with minor changes:
import pandas as pd
import random
import csv
data = pd.DataFrame({'col1': list(range(10)), 'col2': list(range(10, 20))})
data.to_csv('test_file.csv', index=False)
n_chunks = 2
file_names = [f'test_batch_{batch_no}.csv' for batch_no in range(n_chunks)]
# Open files
files = [open(file_name, 'w') for file_name in file_names]
# use files
chunks = [csv.DictWriter(file, ['col1', 'col2']) for file in files]
for chunk in chunks:
chunk.writeheader()
with open('test_file.csv', newline='') as data:
reader = csv.DictReader(data)
for line in reader:
random.choice(chunks).writerow(line)
# close files
for file in files:
file.close()
for file_name in file_names:
df = pd.read_csv(file_name)
print(df)
Sample output:
col1 col2
0 0 10
1 1 11
2 2 12
3 6 16
col1 col2
0 3 13
1 4 14
2 5 15
3 7 17
4 8 18
5 9 19