regexgoogle-sheets

How do I remove a word that preceeds a character as well as the character itself


I am using IMPORTXML to grab some data and then regex to pull some text out. The part i want to extract is a name.

It always follows "Full Name:" (minus the quotes). After the name, there is always a word followed by another colon. This word could be Pronounciation, Linguistics, Type, or anything for that matter.

The closest I can get is this formula:

=IFERROR(REGEXEXTRACT(REGEXREPLACE(JOIN(" ", IMPORTXML(B13, "//div[@id='meta']")), "\s+", " "), "Full Name:\s*([A-Za-z]+(?:[-'\s][A-Za-z]+)*)"), "")

Unfortunately, that still leaves the last word in.

So if you have for example:

Full Name: Abraham Lincoln Nickname: Honest Abe.

I just want it to pull out Abraham Lincoln. That formula as I have it outputs:

Abraham Lincoln Nickname

Other examples:

Respective outputs:

For a live example: https://www.baseball-reference.com/players/l/lopezal01.shtml

The formula should output: Alfonso Ramon Lopez

https://www.baseball-reference.com/players/r/ruthba01.shtml

The formula should output: George Herman Ruth

What else do i need to do keeping in mind that i want to make sure names like O'Brien and Smith-Rogers aren't impacted?


Solution

  • This approach uses a slight variation of given xpath-query, which then extracts the Full Name

    =let(Σ,tocol(importxml(A2,"//*[@id='meta']/*")),
     +sort(regexextract(Σ,"Full Name: (.+)")))
    

    enter image description here