pythonexcelpandasxlrd

python: converting corrupt xls file


I have downloaded few sales dataset from a SAP application. SAP has automatically converted the data to .XLS file. Whenever I open it using Pandas library I am getting the following error:

XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '\xff\xfe\r\x00\n\x00\r\x00'

When I opened the .XLS file using MSEXCEL it is shows a popup saying that the file is corrupt or unsupported extension do you want to continue when I clicked 'Yes' its showing the correct data. When I saved the file again as .xls using msexcel I am able to use it using Pandas.

So, I tried renaming the file using os.rename() but it dint work. I tried opening the file and removing \xff\xfe\r\x00\n\x00\r\x00, but then also it dint work.

The solution is to open MSEXCEL and save the file again as .xls manually, is there any way to automate this. Kindly help.


Solution

  • Finally I converted the corrupt .xls to a correct .xls file. The following is the code:

    # Changing the data types of all strings in the module at once
    from __future__ import unicode_literals
    # Used to save the file as excel workbook
    # Need to install this library
    from xlwt import Workbook
    # Used to open to corrupt excel file
    import io
    
    filename = r'SALEJAN17.xls'
    # Opening the file using 'utf-16' encoding
    file1 = io.open(filename, "r", encoding="utf-16")
    data = file1.readlines()
    
    # Creating a workbook object
    xldoc = Workbook()
    # Adding a sheet to the workbook object
    sheet = xldoc.add_sheet("Sheet1", cell_overwrite_ok=True)
    # Iterating and saving the data to sheet
    for i, row in enumerate(data):
        # Two things are done here
        # Removeing the '\n' which comes while reading the file using io.open
        # Getting the values after splitting using '\t'
        for j, val in enumerate(row.replace('\n', '').split('\t')):
            sheet.write(i, j, val)
    
    # Saving the file as an excel file
    xldoc.save('myexcel.xls')
    
    import pandas as pd
    df = pd.ExcelFile('myexcel.xls').parse('Sheet1')
    

    No errors.