python-docxbulletedlistnumbered-listisspace

Why is isspace() returning false for strings from the docx python library that are empty?


My objective is to extract strings from numbered/bulleted lists in multiple Microsoft Word documents, then to organize those strings into a single, one-line string where each string is ordered in the following manner: 1.string1 2.string2 3.string3 etc. I refer to these one-line strings as procedures, consisting of 'steps' 1., 2., 3., etc.

The reason it has to be in this format is because the procedure strings are being put into a database, the database is used to create Excel spreadsheet outputs, a formatting macro is used on the spreadsheets, and the procedure strings in question have to be in this format in order for that macro to work properly.

The numbered/bulleted lists in MSword are all similar in format, but some use numbers, some use bullets, and some have extra line spaces before the first point, or extra line spaces after the last point.

The following text shows three different examples of how the Word documents are formatted:

Paragraph Keyword 1: arbitrary text
1. Step 1
2. Step 2
3. Step 3
Paragraph Keyword 2: arbitrary text

Paragraph Keyword 3: arbitrary text
• Step 1
• Step 2
• Step 3

Paragraph Keyword 4: arbitrary text

Paragraph Keyword 5: arbitrary text

  1. Step 1
  2. Step 2
  3. Step 3

Paragraph Keyword 6: arbitrary text

(For some reason the first two lists didn't get indented in the formatting of the post, but in my word document all the indentation is the same)

When the numbered/bulleted list is formatted without line extra spaces, my code works fine, e.g. between "paragraph keyword 1:" and "paragraph keyword 2:".

I was trying to use isspace() to isolate the instances where there are extra line spaces that aren't part of the list that I want to include in my procedure strings.

Here is my code:

''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
def extractStrings(file):
    doc = file
    for i in range(len(doc.paragraphs)):
        str1 = doc.paragraphs[i].text
        if "Paragraph Keyword 1:" in str1:
            start1=i
        if "Paragraph Keyword 2:" in str1:
            finish1=i
        if "Paragraph Keyword 3:" in str1:
            start2=i
        if "Paragraph Keyword 4:" in str1:
            finish2=i
        if "Paragraph Keyword 5:" in str1:
            start3=i
        if "Paragraph Keyword 6:" in str1:
            finish3=i
    print("----------------------------")
    procedure1 = ""
    y=1
    for x in range(start1 + 1, finish1):
        temp = str((doc.paragraphs[x].text))
        print(temp)
        if not temp.isspace():
            if y > 1:
                procedure1 = (procedure1 + " " + str(y) + "." + temp)
            else:
                procedure1 = (procedure1 + str(y) + "." + temp)
            y=y+1
            print(procedure1)
    print("----------------------------")
    procedure2 = ""
    y=1
    for x in range(start2 + 1, finish2):
        temp = str((doc.paragraphs[x].text))
        print(temp)
        if not temp.isspace():
            if y > 1:
                procedure2 = (procedure2 + " " + str(y) + "." + temp)
            else:
                procedure2 = (procedure2 + str(y) + "." + temp)
            y=y+1
            print(procedure2)
    print("----------------------------")
    procedure3 = ""
    y=1
    for x in range(start3 + 1, finish3):
        temp = str((doc.paragraphs[x].text))
        print(temp)
        if not temp.isspace():
            if y > 1:
                procedure3 = (procedure3 + " " + str(y) + "." + temp)
            else:
                procedure3 = (procedure3 + str(y) + "." + temp)
            y=y+1
            print(procedure3)
    print("----------------------------")
    del doc
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''

import docx
doc1 = docx.Document("docx_isspace_experiment_042420.docx")
extractStrings(doc1)
del doc1

Unfortunately I have no way of putting the output into this post, but the problem is that whenever there is a blank line in the word doc, isspace() returns false, and a number "x." is assigned to empty space, so I end up with something like: 1. 2.Step 1 3.Step 2 4.Step 3 5. 6. (that's the last iteration of print(procedure3) from the code)

The problem is that isspace() is returning false even when my python console output shows that the string is just a blank line.

Am I using isspace() incorrectly? Is there something in the string I am not detecting that is causing isspace() to return false? Is there a better way to accomplish this?


Solution

  • Use the test:

    # --- for s a str value, like paragraph.text ---
    if s.strip() == "":
        print("s is a blank line")
    

    str.isspace() returns True if the string contains only whitespace. An empty str contains nothing, and so therefore does not contain whitespace.