pythonexcelopenpyxl

openpyxl workbook.save() corrupts simple xlsx workbook with external link


I have a simple Excel xlsx file with one sheet and one external link which becomes corrupted if I open it and save it with openpyxl.

This is a question similar to others (here and here and elsewhere). However, my example does not contain complicated excel features like charts / images / macros. And my example is not solved by different versions of openpyxl. I have tested these versions of openpyxl:

I have a simple Excel file with one sheet. It has one formula which is a link to another workbook. The formula is:

='C:\Temp\temp\[Book2.xlsx]Sheet1'!$A$3

Now I run the simple program:

def main():
    f1="C:/Temp/temp/Book1.xlsx"
    print(openpyxl.__version__)
    wb=openpyxl.load_workbook(filename=f1)
    wb.save(f1)
    wb.close()

After I run this the file becomes corrupted:

enter image description here

If I allow Excel to attempt to recover the file, the external link appears as: =[RecoveredExternalLink1]Sheet1!$A$3

I can update this link to the original file and no other corruption appears to have occurred.

This problem seems to have happened recently. Similar code I have been using for over a year has started to produce this issue in the last couple of months. My version of Excel (64 bit) is:

enter image description here


Solution

  • This is a known bug of openpyxl

    Root cause:

    In brief, when you track down in the XML structure of the Excel (change the file extension from .xlsx to .zip and unzip), the folder xl\externalLinks\_rels contains a file with external link reference, as follows (e.g. external link to a local file):

    (real case sample I got)

    Original, healthy Excel file XML structure:

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
        <Relationship Id="rId2"
                      Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/externalLinkPath"
                      Target="file:///C:\temp\sample.xlsx"
                      TargetMode="External"/>
        <Relationship Id="rId1"
                      Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/externalLinkPath"
                      Target="/temp/sample.xlsx"
                      TargetMode="External"/>
    </Relationships>
    

    After saving the Excel workbook by openpyxl, we may get the following structure which gives us a corrupted Excel file:

    Corrupted Excel file XML structure:

    <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
        <Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/externalLinkPath"
                      Target="file:///C:\temp\sample.xlsx"
                      TargetMode="External"
                      Id="rId2"/>
    </Relationships>
    

    Note that this XML file produced by openpyxl has only ONE (the first) Id tag of rId2 (instead of rId1). An interesting experiment you can do is to change this Id tag to rId1. The corrupted Excel file will then be rescued and opened without error. Probably the Excel app expects the XML structure has an Id="rId1" and treat it as a corrupted file if it fails to locate this Id.

    How to solve the problem?

    Official fix is still not available from openpyxl The issue has been raised to the maintainers of openpyxl dated back to 2 years ago in April 2023. After a series of discussions and contributions by several users, one user has devised a very nice solution, in July 2024, with revised program codes. The revised codes have received many positive feedback (including mine).

    The solution is to enhance openpyxl to support writing multiple entries of the Relationship tags in the underlying XML structure of the Excel file. The program revisions are relatively succinct. We need only to revise a few lines in each of the 3 program source files.

    Quoting from the author's post (credit should go to the author nicoamaro), the amendments:

    The author even provided a diff file (based on the latest version 3.1.5) here. You can refer to the diff file for the required code changes to fix the problem. If you are using version 3.1.5 (the latest version as at today), you can simply download the revised codes from his/her commit attempt here. I found my installation of version 3.1.5 by conda doesn't have the test_external.py file. I just amended the other 2 program files and it still can fix the problem.

    If you are interested to see more details of the previous discussions and other users' contributions leading to the solution, you can refer to the official issue report here.

    One last thing: I have tried to alert the maintainer of openpyxl of the available fix solution. Hope they can take the solution into coming releases so that we don't need to manually change the programs ourselves. More importantly, to help more people to avoid the problem and save their time and efforts trying to fix it.