pythonregexpandasformat-conversion

extracting an integer from aKorean text string, conditional on its content and converting it to float in pandas


I'm a bit stuck on the following problem: I have a pandas data frame where one of the columns is a string of text in Korean that looks like this:

data = {'id': [1,2,3,4,5], 'age': ['3.5년령(추정)','3개월령','5일령(추정)','3일령','1.5개월령(추정)']}
df = pd.DataFrame(data)

Conditionally on what the string contains, I need to calculate the age in days. The text in parenteses (추정), may or may not appear in the string and it means "estimated". The text just before parentheses can be 년령 (years), 개월령 (months) or 일령 (days). And finally, the number before the text can be an integer or a float with one or 2 decimals. I need to extract the number and convert it to age in days (rounded to 0 decimal places), like this:

result = {'id': [1,2,3,4,5],'age': [1278, 90, 5, 3, 45]}
df1 = pd.DataFrame(result)

I've tried to extract the numeric part of the string using regex as shown below but it doesn't cover all the cases and doesn't seem to work well either.

df['age'].str.replace(r'\([추정]\)$', '')

I would appreciate any suggestions. Thank you.


Solution

  • Use:

    d = {'년령': 365, '개월령' : 30, '일령' : 1}
    pat = r'(\d*\.\d+|\d+)'
    #replace by dictionary
    b = df['age'].replace(d, regex=True)
    #https://stackoverflow.com/a/4703409/2901002
    a = df['age'].str.extract(pat, expand=False).astype(float)
    #multiple together
    df['age'] = b * a
    print (df)
       id     age
    0   1  1277.5
    1   2    90.0
    2   3     5.0
    3   4     3.0
    4   5    45.0