pythonregexdata-cleaning

Pairing elements around a separator using Python


I am relatively inexperienced at Python, and I've hit a wall using it to clean some text data into a usable format.

Essentially: I have paired names and values separated by a variable number of periods. This feature of the text is thankfully regular, but the surrounding format has much variability - There can be multiple (name, value) pairs on a single line, there can be additional useless text on any given line (and this "useless text" can include any characters, not just alphabetic), there can be entire lines with no useful data, etc.

An example of what the data looks like follows

string = 'apples, red .... 0.15 apples, green ... 0.99\nbananas (bunch).......... 0.111\nfruit salad, small........1.35 [unwanted stuff #1.11 here]\nunwanted line here\nfruit salad, large .... 1.77 strawberry ........ 0.66 unwanted 00-11info here'

Just to make the string easier for you to see on StackOverflow, this is what "the data" looks like visually, splitting on the newlines

apples, red .... 0.15 apples, green ... 0.99
bananas (bunch).......... 0.111
fruit salad, small........1.35 [unwanted stuff #1.11 here]
unwanted line here
fruit salad, large .... 1.77 strawberry ........ 0.66 unwanted 00-11info here

Another lucky feature of the string is that "unwanted text" will always follow the values and be at the end of the line. I do not need to worry about unwanted text being next to the (name).

At the end of the day, I want to get

apples, red | 0.15
apples, green | 0.99
bananas (bunch) | 0.111
fruit salad, small | 1.35
fruit salad, large | 1.77
strawberry | 0.66

or something similar that can be loaded into R, excel, etc.

I have tried using split and regular expressions by splitting on the variable number of periods, but I'm struggling to write an expression that gives me what I want. For example, I tried

string = 'apples, red .... 0.15 apples, green ... 0.99\nbananas (bunch).......... 0.111\nfruit salad, small........1.35 [unwanted stuff #1.11 here]\nunwanted line here\nfruit salad, large .... 1.77 strawberry ........ 0.66 unwanted 00-11info here'

text = re.split(r"\.{3,}|\n", string)
print(text)

which splits on either a newline or 3+ periods and gives

['apples, red ', ' 0.15 apples, green ', ' 0.99', 'bananas (bunch)', ' 0.111', 'fruit salad, small', '1.35 [unwanted stuff #1.11 here]', 'unwanted line here', 'fruit salad, large ', ' 1.77 strawberry ', ' 0.66 unwanted 00-11info here'] 

which is close, but the problems with this solution are:

(1) Each element in the list is not a correct (name, value) pair, as the split occurs between the (name) and (value) elements. E.g., the 0.15 should be associated with "apples, red", but instead it shares the list element with the subsequent "apples, green".

(2) There is some additional unnecessary text hanging about after some of the values. I could probably brute force some additional post-processing, but I feel like there should be a more elegant solution given the regularity of the string. I.e., there should be some regex out there that can look for "alphabetic characters" followed by "3 or more periods" followed by "number", with any additional text following the "number" being tossed out as useless.

Any help would be much appreciated. Thank you!


Solution

  • Try matching on:

    ([^\d.\n]+)[^\S\n]*\.{3,}[^\S\n]*(\d+.\d+)
    

    See: regex101

    Then you can just join the strings:

    import re
    
    string = 'apples, red .... 0.15 apples, green ... 0.99\nbananas (bunch).......... 0.111\nfruit salad, small........1.35 [unwanted stuff #1.11 here]\nunwanted line here\nfruit salad, large .... 1.77 strawberry ........ 0.66 unwanted 00-11info here'
    
    pattern=r"([^\d.\n]+)[^\S\n]*\.{3,}[^\S\n]*(\d+.\d+)"
    
    matches=re.findall(pattern,string)
    
    "\n".join(" | ".join(m.strip() for m in pair) for pair in matches)
    

    Explanation

    Between the item and the value you find

    Then match the amount