pythonpandasserpapi

How to filter fows in Pandas using partial string in column


I've used the SerpAPI to pull down some data about jobs in a sector I want to return to.

There is a lot of junk about training and I'd like to remove the results based on the displayed_link column.

position    title   link    displayed_link  date    snippet snippet_highlighted_words   sitelinks   about_this_result   about_page_link about_page_serpapi_link cached_page_link    related_questions   rich_snippet    related_pages_link  thumbnail   duration    key_moments
0   1   What Does a Data Analyst Do? Your 2022 Career ...   https://www.coursera.org/articles/what-does-a-...   https://www.coursera.org › Coursera Articles ›...   Nov 14, 2022    A data analyst is a person whose job is to gat...   [data analyst]  {'inline': [{'title': 'Business analyst', 'lin...   {'source': {'description': 'Coursera Inc. is a...   https://www.google.com/search?q=About+https://...   https://serpapi.com/search.json?engine=google_...   https://webcache.googleusercontent.com/search?...   NaN NaN NaN NaN NaN NaN
1   2   What Does a Data Analyst Do? Exploring the Day...   https://www.rasmussen.edu/degrees/technology/b...   https://www.rasmussen.edu › degrees › technolo...   Sep 19, 2022    Generally speaking, a data analyst will retrie...   [data analyst, Data analysts]   {'inline': [{'title': 'Where Do Data Analysts ...   {'source': {'description': 'Rasmussen Universi...   https://www.google.com/search?q=About+https://...   https://serpapi.com/search.json?engine=google_...   https://webcache.googleusercontent.com/search?...   NaN NaN NaN NaN NaN NaN
2   3   Become a Data Analyst Learning Path - LinkedIn  https://www.linkedin.com/learning/paths/become...   https://www.linkedin.com › learning › become-a...   NaN Data analysts examine information using data a...   [Data analysts, data analysis]  NaN {'source': {'description': 'LinkedIn is an Ame...   https://www.google.com/search?q=About+https://...   https://serpapi.com/search.json?engine=google_...   NaN NaN NaN NaN NaN NaN NaN
3   4   What Does a Data Analyst Do? - SNHU https://www.snhu.edu/about-us/newsroom/stem/wh...   https://www.snhu.edu › about-us › newsroom › stem   

Tried manually creating of the sites I want to exclude sites in this list

promotions = ["coursera"
,"rasmussen"
,"snhu"
,"mastersindatascience"
,"northeastern"
,"mygreatlearning"
,"payscale.com"
,"careerfoundry"
,"microsoft.com"
,"codecademy"
,"edx.org"
,"ahima.org"
,"›certification-exams›chda'"]

Tried this:

df['displayed_link'].map(lambda x: "T" if x in promotions else "F")

And all it does is return F - I'm guessing because it needs exact string.

df['displayed_link'].map(lambda x: "T" if promotions in x else "F")

I tried it the other way, but that was a syntax error.

What is the most efficient way of filtering rows based on a column based on a list of manually curated strings?

    enter code here

Solution

  • Use Series.str.contains with chain list by | for regex OR:

    df['test1'] = np.where(df['displayed_link'].str.contains('|'.join(promotions)), 'T', 'F')
    df['test2'] = (df['displayed_link'].str.contains('|'.join(promotions))
                                       .map({True:'T',False: 'F'}))
    

    If necessary, use words boundaries \b\b:

    pat = '|'.join(rf"\b{x}\b" for x in promotions))
    df['test3']= np.where(df['displayed_link'].str.contains(pat), 'T', 'F')
    df['test4']= df['displayed_link'].str.contains(pat).map({True:'T',False: 'F'})
    print (df)
                                      displayed_link test1 test2 test3 test4
    0  https://www.coursera.org/articles/what-does-a     T     T     T     T
    1  https://www.rasmussen.edu/degrees/technology/     T     T     T     T
    2       https://www.linkedin.com/learning/paths/     F     F     F     F
    3  https://www.snhu1.edu/about-us/newsroom/stem/     T     T     F     F