[SOLVED] Parsing Korean text into a list using regex

Parsing Korean text into a list using regex

I have some data stored as pandas data frame and one of the columns contains text strings in Korean. I would like to process each of these text strings as follows:

my_string = '모질상태불량(피부상태불량, 심하게 야윔), 치석심함, 양측 수정체 백탁, 좌측 화농성 눈곱심함(7/22), 코로나음성(활력저하)'

Into a list like this:

parsed_text = '모질상태불량, 피부상태불량, 심하게 야윔, 치석심함, 양측 수정체 백탁, 좌측 화농성 눈곱심함(7/22), 코로나음성, 활력저하'

So the problem is to identify cases where a word (or several words) are followed by parentheses with text only (can be one words or several words separated by commas) and replace them by all the words (before and inside parentheses) separated by comma (for later processing). If a word is followed by parentheses containing numbers (as in this case 7/22), it should be kept as it is. If a word is not followed by any parentheses, it should also be kept as it is. Furthermore, I would like to preserve the order of words (as they appeared in the original string).

I can extract text in parentheses by using regex as follows:

corrected_string = re.findall(r'(\w+)\((\D.*?)\)', my_string)

which yields this:

[('모질상태불량', '피부상태불량, 심하게 야윔'), ('코로나음성', '활력저하')]

But I'm having difficulty creating my resulting string, i.e. replacing my original text with the pattern I've matched. Any suggestions? Thank you.

Solution

You can use re.findall with a pattern that optionally matches a number enclosed in parentheses:

corrected_string = re.findall(r'[^,()]+(?:\([^)]*\d[^)]*\))?', my_string)