xpathweb-scrapinggoogle-sheetsgoogle-sheets-formula

ImportXML XPath issue using Google Sheets on a simple web scraping query


I've been trying with no success to importxml using google sheets to scrape the Advanced Receiving table data from the url https://www.pro-football-reference.com/boxscores/201912290car.htm.

I've tried the XPath copied directly from the inspect chrome page of: //*[@id="div_receiving_advanced"] where I always get the "Imported content is empty" error message.

I'm stumped because it works with the Passing, Rushing, & Receiving table data using the XPath of: //*[@id="div_player_offense"]

When I use the XPath of: //*[@id="all_receiving_advanced"], I get the following results.

unparsed results

However, I'd like to parse the data from the 2nd column so it looks like this.

parsed results

Any help would be greatly appreciated.


Solution

  • Since some players don't have value for specific columns (for eg : "Rec/Br"), transforming directly the data returned by IMPORTXML will produce a scrambled table.

    2 solutions :

    A) Use IMPORTFROMWEB addon (number of requests are limited in the free plan) with JS rendering activated and a base selector option to keep the data structure. XPath expressions needed for data :

    /th/a
    /td[@data-stat="team"]
    /td[@data-stat="targets"]
    /td[@data-stat="rec"]
    /td[@data-stat="rec_yds"]
    /td[@data-stat="rec_first_down"]
    /td[@data-stat="rec_air_yds"]
    /td[@data-stat="rec_air_yds_per_rec"]
    /td[@data-stat="rec_yac"]   
    /td[@data-stat="rec_yac_per_rec"]   
    /td[@data-stat="rec_broken_tackles"]    
    /td[@data-stat="rec_broken_tackles_per_rec"]    
    /td[@data-stat="rec_drops"] 
    /td[@data-stat="rec_drop_pct"]
    

    for the headers :

    //div[@id="div_receiving_advanced"]//th[contains(@class,"poptip")]
    

    for the base selector :

    //div[@id="div_defense_advanced"]//tr[@data-row][not(@class)]
    

    Formula used in C6 :

    IMPORTFROMWEB(B1;B2:O2;B3:C4)
    

    Output :

    IFWrev2f

    Side note : IMPORTFROMWEB often output loading errors.

    B) Use IMPORTDATA and formulas to generate the table. First we load the data of interest with a filter (QUERY). Then we fix the blank cells problem with SUBSTITUTE. After that we extract the data with REGEXEXTRACT. Finally we apply a last filter and SPLIT the data to populate the cells.

    Formula :

    =ARRAYFORMULA(SPLIT(QUERY(ARRAYFORMULA(REGEXREPLACE(ARRAYFORMULA(SUBSTITUTE(QUERY(IMPORTDATA(B3);"select Col1 where Col1 contains 'rec_broken_tackles_per_rec'");"></td>";">0</td>"));".+htm.+?>(.+?)<.+team.+([A-Z]{3}).+targets.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+";"$1;$2;$3;$4;$5;$6;$7;$8;$9;$10;$11;$12;$13;$14"));"select * WHERE NOT Col1 contains '<'");";"))
    

    Output :

    IDATA

    In both cases, blank cells are replaced with 0.

    Blank

    My working workbook is here.

    EDIT :

    For "Advanced Defense Table" with IMPORTDATA :

    =ARRAYFORMULA(SPLIT(QUERY(ARRAYFORMULA(REGEXREPLACE(ARRAYFORMULA(SUBSTITUTE(QUERY(IMPORTDATA(B3);"select Col1 where Col1 contains 'def_tgt_yds_per_att'");"></td>";">0</td>"));".+htm.+?>(.+?)<.+team.+([A-Z]{3})<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?bli.+?>(.+?)<.+?qb_.+?>(.+?)<.+?qb_.+?>(.+?)<.+?sac.+?>(.+?)<.+?pre.+?>(.+?)<.+?tac.+?>(.+?)<.+?tac.+?>(.+?)<.+?tac.+?>(.+?)<.+";"$1;$2;$3;$4;$5;$6;$7;$8;$9;$10;$11;$12;$13;$14;$15;$16;$17;$18;$19;$20;$21;$22"));"select * WHERE NOT Col1 contains '<'");";"))
    

    Output :

    ADT