[SOLVED] Pandas replace multiple substring patterns via dictionary

Pandas replace multiple substring patterns via dictionary

Suppose we want to replace multiple substrings via pd.Series.replace or pd.DataFrame.replace by passing a dictionary to the to_replace argument

What happens if multiple patterns (the dictionary keys) match in the string?
Are applicable replacements performed at once or consecutively?
If the latter, in which order are the replacements performed (e.g. the order the pattern matches occur in the string)?
What happens if multiple patterns match substrings at the same position in the string (which can happen with regexes)?
What happens if substrings in the replacement values match the patterns themselves?

Example:

Replace

'nan' --> 'miss'
'nan.*\b' --> 'nanword'
'na' --> 'no'
'miss' --> 'mrs'
'bana' --> 'eric'

in the string 'Nana likes bananas and ananas'.

Solution

Let's try a short example:

s = pd.Series(['abcde', 'bcde', 'xyz'])

s.replace(to_replace={'ab': 'xy', 'bc': 'BC', 'cd': 'CD', 'xy': 'XY'}, regex=True)

0    xyCDe
1     BCde
2      XYz
dtype: object

What happens if multiple patterns (the dictionary keys) match in the string? The keys are evaluated in order, in case of an overlap only the first match is replaced.
Are applicable replacements performed at once or consecutively? From the perspective of the user, the replacements are performed simultaneously (i.e. there is no circular replacement). In the above example xy that is replacing ab is not further replaced by XY.
In which order are the replacements performed (e.g. the order the pattern matches occur in the string)? The order in the dictionary matters.

# let's swap the first two keys
s.replace(to_replace={'bc': 'BC', 'ab': 'xy', 'cd': 'CD', 'xy': 'XY'}, regex=True)

0    aBCde
1     BCde
2      XYz
dtype: object

What happens if multiple patterns match substrings at the same position in the string (which can happen with regexes)? As shown above, the first match (in terms on position in the dictionary, not the string) is considered (ab vs bc in abc). Below are other examples.

# overlapping regex, with lookarounds
pd.Series(['abcde']).replace(to_replace={'a(?=b)': 'A', '(?<=b)c': 'C'}, regex=True)
0    AbCde
dtype: object

# overlapping regex in which the first pattern breaks the second one
pd.Series(['abcde']).replace(to_replace={'ab': 'A', '(?<=b)c': 'C'}, regex=True)
0    Acde
dtype: object

# overlapping pattern in which the replacement preserves the second pattern
pd.Series(['abcde']).replace(to_replace={'ab': 'Ab', '(?<=b)c': 'C'}, regex=True)
0    AbCde
dtype: object

# overlapping pattern in which the replacement creates the second pattern
pd.Series(['abcde']).replace(to_replace={'ab': 'Ax', '(?<=x)c': 'C'}, regex=True)
0    Axcde
dtype: object

What happens if substrings in the replacement values match the patterns themselves? Nothing, there is no circular replacement.