pythonparsingshlex

Python: Parsing data containing both types of quotation as well as special characters


Hi All I am working on a project where I need to parse some data containing both " and ' quotation marks as well as special characters. While the data is confidential and therefore cannot be posted on here the text below replicates the issue.

"""
Brian: "I am not the messiah" Arthur:\n\t "I say you are Lord and I should know I've followed a few"

The end goal is to get the text in the form:

['Brian:', '"I am not the messiah"', 'Arthur:', '"I say you are Lord and I should know I've followed a few"']

That is to say all newline and tab characters removed, splitting on newlines (though this is read from a file so .readlines() takes care of that) and any spaces but not within double (") quotation marks.

The code

import shlex as sh
line_info = sh.split(line.removesuffix("\n").replace("\t", " "))

comes close to success but but fails to retain the quotations marks (I don't need the quotation marks themselves but I do need an indication the text was quoted for further processing)

Edit:

Original question had the example with all quoted phrases on separate lines to non-quoted ones. Unfortunately this is not the case in the file


Solution

  • I think the problem lies in the shlex module, stripping the quotation marks. But there is an easy solution with the extra argument posix=False. With this argument, the quotation marks are kept intact, see e.g. here:

    import io
    import shlex
    
    text = """
    Brian: "I am not the messiah" Arthur:\n\t "I say you are Lord and I should know I've followed a few"
    """
    
    result = []
    for line in io.StringIO(text).readlines():
        line_info = shlex.split(line.removesuffix("\n").replace("\t", " "), posix=False)
        result.extend(line_info)
    
    expectation = [
        'Brian:',
        '"I am not the messiah"',
        'Arthur:',
        '"I say you are Lord and I should know I\'ve followed a few"'
    ]
    
    assert result == expectation
    

    I am faking here the string as a file object to have the possibility to apply the readlines-method to be closer to your code, hopefully.