pythonregexstringsplit

Split a string in python on a floating point number, if no floating point number found, split it on a number


I have a list of strings and I want to split each string on a floating point number. If there is no floating point number in the string, I want to split it on a number. It should only split once and return everything before and after it separated by commas.

Input string:

['Naproxen  500  Active ingredient  Ph Eur',
 'Croscarmellose sodium  22.0 mg Disintegrant  Ph Eur',
 'Povidone K90  11.0   Binder 56 Ph Eur',
 'Water, purifieda',
 'Silica, colloidal anhydrous  2.62  Glidant  Ph Eur',
 'Magnesium stearate  1.38  Lubricant  Ph Eur']

Expected output:

['Naproxen',  '500',  'Active ingredient  Ph Eur',
 'Croscarmellose sodium',  '22.0 mg',  'Disintegrant  Ph Eur',
 'Povidone K90',  '11.0',  'Binder  Ph Eur',
 'Water, purified',
 'Silica, colloidal anhydrous',  '2.62',  'Glidant  Ph Eur',
 'Magnesium stearate',  '1.38',  'Lubricant  Ph Eur']

Solution

  • Try this re.split option:

    inp = 'Croscarmellose sodium  22.0 mg Disintegrant  Ph Eur'
    parts = re.split(r'\s+(\d+(?:\.\d+)?)\s+', inp, 1)
    print(parts)
    

    This prints:

    ['Croscarmellose sodium', '22.0', 'mg Disintegrant  Ph Eur']
    

    The idea is to split on this regex pattern:

    \s+(\d+(?:\.\d+)?)\s+
    

    This matches a number, with optional decimal component, surrounded by whitespace. Note that we place parentheses around the number, since we do not want to consume it in the split. Also note carefully that re.split is being used with its third parameter set to 1, which tells Python to split only once.