I've been trying with no success to importxml using google sheets to scrape the Advanced Receiving table data from the url https://www.pro-football-reference.com/boxscores/201912290car.htm.
I've tried the XPath copied directly from the inspect chrome page of: //*[@id="div_receiving_advanced"]
where I always get the "Imported content is empty" error message.
I'm stumped because it works with the Passing, Rushing, & Receiving table data using the XPath of: //*[@id="div_player_offense"]
When I use the XPath of: //*[@id="all_receiving_advanced"]
, I get the following results.
However, I'd like to parse the data from the 2nd column so it looks like this.
Any help would be greatly appreciated.
Since some players don't have value for specific columns (for eg : "Rec/Br"), transforming directly the data returned by IMPORTXML
will produce a scrambled table.
2 solutions :
A) Use IMPORTFROMWEB
addon (number of requests are limited in the free plan) with JS rendering activated and a base selector option to keep the data structure. XPath expressions needed for data :
/th/a
/td[@data-stat="team"]
/td[@data-stat="targets"]
/td[@data-stat="rec"]
/td[@data-stat="rec_yds"]
/td[@data-stat="rec_first_down"]
/td[@data-stat="rec_air_yds"]
/td[@data-stat="rec_air_yds_per_rec"]
/td[@data-stat="rec_yac"]
/td[@data-stat="rec_yac_per_rec"]
/td[@data-stat="rec_broken_tackles"]
/td[@data-stat="rec_broken_tackles_per_rec"]
/td[@data-stat="rec_drops"]
/td[@data-stat="rec_drop_pct"]
for the headers :
//div[@id="div_receiving_advanced"]//th[contains(@class,"poptip")]
for the base selector :
//div[@id="div_defense_advanced"]//tr[@data-row][not(@class)]
Formula used in C6 :
IMPORTFROMWEB(B1;B2:O2;B3:C4)
Output :
Side note : IMPORTFROMWEB
often output loading errors.
B) Use IMPORTDATA
and formulas to generate the table. First we load the data of interest with a filter (QUERY
). Then we fix the blank cells problem with SUBSTITUTE
. After that we extract the data with REGEXEXTRACT
. Finally we apply a last filter and SPLIT
the data to populate the cells.
Formula :
=ARRAYFORMULA(SPLIT(QUERY(ARRAYFORMULA(REGEXREPLACE(ARRAYFORMULA(SUBSTITUTE(QUERY(IMPORTDATA(B3);"select Col1 where Col1 contains 'rec_broken_tackles_per_rec'");"></td>";">0</td>"));".+htm.+?>(.+?)<.+team.+([A-Z]{3}).+targets.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+";"$1;$2;$3;$4;$5;$6;$7;$8;$9;$10;$11;$12;$13;$14"));"select * WHERE NOT Col1 contains '<'");";"))
Output :
In both cases, blank cells are replaced with 0
.
My working workbook is here.
EDIT :
For "Advanced Defense Table" with IMPORTDATA
:
=ARRAYFORMULA(SPLIT(QUERY(ARRAYFORMULA(REGEXREPLACE(ARRAYFORMULA(SUBSTITUTE(QUERY(IMPORTDATA(B3);"select Col1 where Col1 contains 'def_tgt_yds_per_att'");"></td>";">0</td>"));".+htm.+?>(.+?)<.+team.+([A-Z]{3})<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?bli.+?>(.+?)<.+?qb_.+?>(.+?)<.+?qb_.+?>(.+?)<.+?sac.+?>(.+?)<.+?pre.+?>(.+?)<.+?tac.+?>(.+?)<.+?tac.+?>(.+?)<.+?tac.+?>(.+?)<.+";"$1;$2;$3;$4;$5;$6;$7;$8;$9;$10;$11;$12;$13;$14;$15;$16;$17;$18;$19;$20;$21;$22"));"select * WHERE NOT Col1 contains '<'");";"))
Output :