pythonpandasxlsxtxtfile-conversion

Python .txt to .xlsx Conversion Chunking and Indexing Issue


I have a chunking and/or indexing issue with my python code where I am trying to convert a text script into an xlsx file. The problem is that xlsx files have a hard limit for the number of rows you can have:

Traceback (most recent call last):
  File "/Users/rbarrett/Git/Cleanup/yourPeople3/convert_txt_to_xls.py", line 46, in <module>
    df_chunk.to_excel(writer, sheet_name=f'Sheet{sheet_number}', index=False)
  File "/Users/rbarrett/Library/Python/3.11/lib/python/site-packages/pandas/util/_decorators.py", line 333, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/rbarrett/Library/Python/3.11/lib/python/site-packages/pandas/core/generic.py", line 2417, in to_excel
    formatter.write(
  File "/Users/rbarrett/Library/Python/3.11/lib/python/site-packages/pandas/io/formats/excel.py", line 952, in write
    writer._write_cells(
  File "/Users/rbarrett/Library/Python/3.11/lib/python/site-packages/pandas/io/excel/_openpyxl.py", line 487, in _write_cells
    xcell = wks.cell(
            ^^^^^^^^^
  File "/Users/rbarrett/Library/Python/3.11/lib/python/site-packages/openpyxl/worksheet/worksheet.py", line 244, in cell
    cell = self._get_cell(row, column)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/rbarrett/Library/Python/3.11/lib/python/site-packages/openpyxl/worksheet/worksheet.py", line 257, in _get_cell
    raise ValueError(f"Row numbers must be between 1 and 1048576. Row number supplied was {row}")
ValueError: Row numbers must be between 1 and 1048576. Row number supplied was 1048577

As we can see the ValueError: Row numbers must be between 1 and 1048576. Row number supplied was 1048577 is my error and it looks like something is wrong with my slicing in the following script:

#!/usr/bin/env python3

import pandas as pd
import argparse

# Set up argument parsing
parser = argparse.ArgumentParser(description="Convert a TXT file to CSV or XLSX format.")
parser.add_argument("input_txt_file", help="Path to the input TXT file")
parser.add_argument("output_file", help="Path to the output file (either .csv or .xlsx)")
parser.add_argument("--type", choices=['csv', 'xlsx'], required=True, help="Output file type: 'csv' or 'xlsx'")
parser.add_argument("--multiple-sheets", action='store_true', help="Split data across multiple sheets if type is 'xlsx'")
parser.add_argument("--delimiter", default=' ', help="Delimiter used in the input TXT file")

# Parse the arguments
args = parser.parse_args()

# Read the .txt file into a pandas DataFrame
df = pd.read_csv(args.input_txt_file, delimiter=args.delimiter, engine='python')

# Print DataFrame shape for inspection
print(f"DataFrame shape: {df.shape}")
print(df.head())

# Handle output based on the specified type
if args.type == 'csv':
    # Write the DataFrame to a CSV file
    df.to_csv(args.output_file, index=False)
    print(f"Conversion complete: {args.output_file}")

elif args.type == 'xlsx':
    if args.multiple_sheets:
        # Define Excel's maximum row limit
        max_rows = 1048576

        # Create a Pandas Excel writer using openpyxl
        with pd.ExcelWriter(args.output_file, engine='openpyxl') as writer:
            sheet_number = 1
            for i in range(0, len(df), max_rows):
                # Extract the chunk of data to be written
                df_chunk = df.iloc[i:i + max_rows].copy()

                # Reset the index to ensure each sheet starts at row 1
                df_chunk.reset_index(drop=True, inplace=True)

                # Write the chunk to the corresponding sheet
                df_chunk.to_excel(writer, sheet_name=f'Sheet{sheet_number}', index=False)
                sheet_number += 1

        print(f"Conversion complete with multiple sheets: {args.output_file}")
    else:
        # Write the entire DataFrame to a single Excel sheet
        if len(df) > max_rows:
            raise ValueError("Data exceeds Excel's row limit. Use --multiple-sheets to split data across sheets.")
        df.to_excel(args.output_file, index=False, engine='openpyxl')
        print(f"Conversion complete: {args.output_file}")

Say for example, I have text file with a lot of rows that looks like this:

46eb61ab1c0i90e909090w.................2 blob 88924339 logs/swf.log.1
5fb..........................c53da3f0cf1 blob 79474600 logs/swf.log.1
0f373270ad....................e3441da6bd blob 75058654 logs/swf.log.1
7f2..................5e510548fe2f35f9358 blob 74196729 hub/growth/growth/files/NewHireOnboarding.pptx
d7........................1e7e1cb8c0631f blob 70885244 logs/sqllog

but I have a lot of rows in this file, say like 4730559 lines, what happens is I am trying to create another sheet and chunk the pieces such that if I reach the limit, that I can start paginating across sheets. What's wrong with the python script section?

If you want to run the script:

python3 convert_txt_to_xls.py blobs_with_sizes.txt blobs_with_sizes.xlsx --type=xlsx --multiple-sheets --delimiter=' '

I am using the delimiter for ' ' for space between the columns.


Solution

  • You have an overflow of 1

    Replace

    max_rows = 1048576
    

    with

    max_rows = 1048575