pythonregexnlpwhitespaceremoving-whitespace

How to remove spaces in a "single" word? ("bo ok" to "book")


I am reading a badly formatted text, and often there are unwanted spaces inside a single word. For example, "int ernational trade is not good for economies" and so forth. Is there any efficient tool that can cope with this? (There are a couple of other answers like here, which do not work in a sentence.)

Edit: About the impossibility mentioned, I agree. One option is to preserve all possible options. In my case this edited text will be matched with another database that has the original (clean) text. This way, any wrong removal of spaces just gets tossed away.\


Solution

  • You could use the PyEnchant package to get a list of English words. I will assume words that do not have meaning on their own but do together are a word, and use the following code to find words that are split by a single space:

    import enchant
    
    text = "int ernational trade is not good for economies"
    fixed_text = []
    
    d = enchant.Dict("en_US")
    
    for i in range(len(words := text.split())):
        if fixed_text and not d.check(words[i]) and d.check(compound_word := ''.join([fixed_text[-1], words[i]])):
            fixed_text[-1] = compound_word
        else:
            fixed_text.append(words[i])
    
    print(' '.join(fixed_text))
    

    This will split the text on spaces and append words to fixed_text. When it finds that a previously added word is not in the dictionary, but appending the next word to it does make it valid, it sticks those two words together.

    This should help sanitize most of the invalid words, but as the comments mentioned it is sometimes impossible to find out if two words belong together without performing some sort of lexical analysis.

    As suggested by Pranav Hosangadi, here is a modified (and a little more involved) version which can remove multiple spaces in words by compounding previously added words which are not in the dictionary. However, since a lot of smaller words are valid in the English language, many spaced out words don't correctly concatenate.

    import enchant
    
    text = "inte rnatio nal trade is not good for ec onom ies"
    fixed_text = []
    
    d = enchant.Dict("en_US")
    
    for i in range(len(words := text.split())):
        if fixed_text and not d.check(compound_word := words[i]):
            for j, pending_word in enumerate(fixed_text[::-1], 1):
                if not d.check(pending_word) and d.check(compound_word := ''.join([pending_word, compound_word])):
                    del fixed_text[-j:]
                    fixed_text.append(compound_word)
                    break
            else:
                fixed_text.append(words[i])
        else:
            fixed_text.append(words[i])
    
    print(' '.join(fixed_text))