I have a chunking and/or indexing issue with my python code where I am trying to convert a text script into an xlsx file. The problem is that xlsx files have a hard limit for the number of rows you can have:
Traceback (most recent call last):
File "/Users/rbarrett/Git/Cleanup/yourPeople3/convert_txt_to_xls.py", line 46, in <module>
df_chunk.to_excel(writer, sheet_name=f'Sheet{sheet_number}', index=False)
File "/Users/rbarrett/Library/Python/3.11/lib/python/site-packages/pandas/util/_decorators.py", line 333, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/rbarrett/Library/Python/3.11/lib/python/site-packages/pandas/core/generic.py", line 2417, in to_excel
formatter.write(
File "/Users/rbarrett/Library/Python/3.11/lib/python/site-packages/pandas/io/formats/excel.py", line 952, in write
writer._write_cells(
File "/Users/rbarrett/Library/Python/3.11/lib/python/site-packages/pandas/io/excel/_openpyxl.py", line 487, in _write_cells
xcell = wks.cell(
^^^^^^^^^
File "/Users/rbarrett/Library/Python/3.11/lib/python/site-packages/openpyxl/worksheet/worksheet.py", line 244, in cell
cell = self._get_cell(row, column)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/rbarrett/Library/Python/3.11/lib/python/site-packages/openpyxl/worksheet/worksheet.py", line 257, in _get_cell
raise ValueError(f"Row numbers must be between 1 and 1048576. Row number supplied was {row}")
ValueError: Row numbers must be between 1 and 1048576. Row number supplied was 1048577
As we can see the ValueError: Row numbers must be between 1 and 1048576. Row number supplied was 1048577
is my error and it looks like something is wrong with my slicing in the following script:
convert_txt_to_xls.py
#!/usr/bin/env python3
import pandas as pd
import argparse
# Set up argument parsing
parser = argparse.ArgumentParser(description="Convert a TXT file to CSV or XLSX format.")
parser.add_argument("input_txt_file", help="Path to the input TXT file")
parser.add_argument("output_file", help="Path to the output file (either .csv or .xlsx)")
parser.add_argument("--type", choices=['csv', 'xlsx'], required=True, help="Output file type: 'csv' or 'xlsx'")
parser.add_argument("--multiple-sheets", action='store_true', help="Split data across multiple sheets if type is 'xlsx'")
parser.add_argument("--delimiter", default=' ', help="Delimiter used in the input TXT file")
# Parse the arguments
args = parser.parse_args()
# Read the .txt file into a pandas DataFrame
df = pd.read_csv(args.input_txt_file, delimiter=args.delimiter, engine='python')
# Print DataFrame shape for inspection
print(f"DataFrame shape: {df.shape}")
print(df.head())
# Handle output based on the specified type
if args.type == 'csv':
# Write the DataFrame to a CSV file
df.to_csv(args.output_file, index=False)
print(f"Conversion complete: {args.output_file}")
elif args.type == 'xlsx':
if args.multiple_sheets:
# Define Excel's maximum row limit
max_rows = 1048576
# Create a Pandas Excel writer using openpyxl
with pd.ExcelWriter(args.output_file, engine='openpyxl') as writer:
sheet_number = 1
for i in range(0, len(df), max_rows):
# Extract the chunk of data to be written
df_chunk = df.iloc[i:i + max_rows].copy()
# Reset the index to ensure each sheet starts at row 1
df_chunk.reset_index(drop=True, inplace=True)
# Write the chunk to the corresponding sheet
df_chunk.to_excel(writer, sheet_name=f'Sheet{sheet_number}', index=False)
sheet_number += 1
print(f"Conversion complete with multiple sheets: {args.output_file}")
else:
# Write the entire DataFrame to a single Excel sheet
if len(df) > max_rows:
raise ValueError("Data exceeds Excel's row limit. Use --multiple-sheets to split data across sheets.")
df.to_excel(args.output_file, index=False, engine='openpyxl')
print(f"Conversion complete: {args.output_file}")
Say for example, I have text file with a lot of rows that looks like this:
46eb61ab1c0i90e909090w.................2 blob 88924339 logs/swf.log.1
5fb..........................c53da3f0cf1 blob 79474600 logs/swf.log.1
0f373270ad....................e3441da6bd blob 75058654 logs/swf.log.1
7f2..................5e510548fe2f35f9358 blob 74196729 hub/growth/growth/files/NewHireOnboarding.pptx
d7........................1e7e1cb8c0631f blob 70885244 logs/sqllog
but I have a lot of rows in this file, say like 4730559
lines, what happens is I am trying to create another sheet and chunk the pieces such that if I reach the limit, that I can start paginating across sheets. What's wrong with the python script section?
If you want to run the script:
python3 convert_txt_to_xls.py blobs_with_sizes.txt blobs_with_sizes.xlsx --type=xlsx --multiple-sheets --delimiter=' '
I am using the delimiter for ' '
for space between the columns.
You have an overflow of 1
Replace
max_rows = 1048576
with
max_rows = 1048575