Let's say we have a dataframe df
representing the activities of some people as follow:
index | Mary | Tristan | Louise | Arnaud | Justin | Stacy |
---|---|---|---|---|---|---|
0 | Engineer | Software Engineer | Rock Singer | Rap Singer | Lumberjack | Biomedical Engineer |
1 | Guitarist | Aerospace Engineer | Author | Figherfighter | ||
2 | Business Man |
And I would like to check if each activity is or might be software engineering. With s = 'Software Engineer'
, we would obtain:
index | Mary | Tristan | Louise | Arnaud | Justin | Stacy |
---|---|---|---|---|---|---|
0 | True | True | False | False | False | False |
1 | False | False | False | False | False | False |
2 | False | False | False | False | False | False |
Which mean that I want to test for all cells in df
that they are or are not a substring of s
. What already works is the following, but it looks dirty:
s = 'Software Engineer'
df.apply(lambda col: col.apply(lambda x: str(x) in s))
What bothers me is the double apply, there might be a better solution right?
To check every cell in your dataframe if it is a substring of s
no need to numpy, you can use applymap
:
df.applymap(lambda cell: bool(cell) and cell in s)
Note: bool(cell)
is used to exclude empty and NaN cells and mark them as False.
Also if you want the other way around, ie. check if s
is a substring of each cell, you can use vectorized string functions to further optimize your code:
df.apply(lambda column: column.str.contains(s))