I am using IMPORTXML to grab some data and then regex to pull some text out. The part i want to extract is a name.
It always follows "Full Name:" (minus the quotes). After the name, there is always a word followed by another colon. This word could be Pronounciation, Linguistics, Type, or anything for that matter.
The closest I can get is this formula:
=IFERROR(REGEXEXTRACT(REGEXREPLACE(JOIN(" ", IMPORTXML(B13, "//div[@id='meta']")), "\s+", " "), "Full Name:\s*([A-Za-z]+(?:[-'\s][A-Za-z]+)*)"), "")
Unfortunately, that still leaves the last word in.
So if you have for example:
Full Name: Abraham Lincoln Nickname: Honest Abe.
I just want it to pull out Abraham Lincoln. That formula as I have it outputs:
Abraham Lincoln Nickname
Other examples:
Respective outputs:
For a live example: https://www.baseball-reference.com/players/l/lopezal01.shtml
The formula should output: Alfonso Ramon Lopez
https://www.baseball-reference.com/players/r/ruthba01.shtml
The formula should output: George Herman Ruth
What else do i need to do keeping in mind that i want to make sure names like O'Brien and Smith-Rogers aren't impacted?
This approach uses a slight variation of given xpath-query
, which then extracts the Full Name
=let(Σ,tocol(importxml(A2,"//*[@id='meta']/*")),
+sort(regexextract(Σ,"Full Name: (.+)")))