xpathgoogle-sheets-formula

Using IMPORTXML on Google Sheets to extract image URLs


I've seen this asked many times here, but this is very different. The other requests for help using IMPORTXML, the op was asking about pulling standard HTML tags like a/href/img/src/ul/li, etc. What I'm looking to extract is embedded in the style tag, and it isn't extracting as expected.

On this page https://www.northtexasgivingday.org/organization/Salvationarmynorthtexas (and a series of similar pages) I am trying to extract the URL of the logo of various nonprofits.

I highlight the logo in Chrome dev tools and copy the Xpath, or the full Xpath - which does not return the expected logo url using IMPORTXML.

The xapth is //\*\[@id="org-logo"\]/span and the full Xpath is /html/body/app-root/mat-sidenav-container/mat-sidenav-content/div\[2\]/app-organization/div/app-org-page/div/div/div\[1\]/app-banner/div/div\[3\]/span

The image is stored in this element:
<span _ngcontent-ng-c332306571="" role="img" applazyload="" class="image set-background ng-star-inserted" aria-label="The Salvation Army North Texaslogo" style="opacity: 0; animation: 0.7s ease-out 0s 1 normal forwards running lazy-load-anim1; background-repeat: no-repeat; background-size: cover; width: 100%; height: 100%; background-position: center center; background-image: url("https://imagecdn.mightycause.com/36be848e-03bb-47c3-8616-6d51b811e38d/-/format/auto/-/progressive/yes/-/stretch/off/-/preview/&quot;);"></span>

I can see the image url in that element found in the style section under

background-image:

as: https://imagecdn.mightycause.com/36be848e-03bb-47c3-8616-6d51b811e38d/-/format/auto/-/progressive/yes/-/stretch/off/-/preview/

I've tried both Xapth options in these formulas:

IMPORTXML("https://www.northtexasgivingday.org/organization/Salvationarmynorthtexas", "//\*\[@id='org-logo'\]/span" )

or

IMPORTXML("https://www.northtexasgivingday.org/organization/Salvationarmynorthtexas", "/html/body/app-root/mat-sidenav-container/mat-sidenav-content/div\[2\]/app-organization/div/app-org-page/div/div/div\[1\]/app-banner/div/div\[3\]/span" )

These are not working. Can you guys help me figure this out?


Solution

  • When I saw the HTML of your URL, I noticed that the URL https://imagecdn.mightycause.com/36be848e-03bb-47c3-8616-6d51b811e38d/-/format/auto/-/progressive/yes/-/stretch/off/-/preview/ is created by Javascript. In this case, unfortunately, IMPORTXML cannot be directly used. But, fortunately, I found the image URL from JSON data in the HTML as https://imagecdn.mightycause.com/36be848e-03bb-47c3-8616-6d51b811e38d/. It seems that this image is the same as your expected URL. So, in this answer, I would like to propose to retrieve the URL.

    Sample script:

    In this case, it is required to use Google Apps Script. So, please open the script editor of your Spreadsheet, copy and paste the following script, and save the script.

    When you use this script, please put a custom function =SAMPLE("https://www.northtexasgivingday.org/organization/Salvationarmynorthtexas") into a cell. By this, the image URL is returned.

    function SAMPLE(url) {
      const res = UrlFetchApp.fetch(url);
      const str = res.getContentText().match(/<script id\="ng-state" type\="application\/json">(.*?)<\/script>/);
      const obj = JSON.parse(str[1]);
      const ar = Object.values(obj).flatMap(v => v?.body?.org_info?.circle_pic || []);
      return ar.length > 0 ? ar[0] : "";
    }
    

    Testing:

    When this script is tested, the following result is obtained. In this sample, =SAMPLE("https://www.northtexasgivingday.org/organization/Salvationarmynorthtexas") is put into cell "A1". And, =IMAGE(A1) is put into cell "B1".

    enter image description here

    Note:

    Reference: