I am trying to pull data from WSJ and yahoofiance using google sheet app script.
I was able to pull some data through following code with HTML of current price from following page ie <span id="quote_val">(.+?)<\/span>
....please note that it containing Id .
Now i am trying to pull target price with HTML of <span class="data_data"><sup>$</sup>50.80</span>
. This will not give me desired result . please note that it contains Class not id..
When we choose url as lets say https://www.wsj.com/market-data/quotes/PCT/research-ratings
function SAMPLE(url) {
const html = UrlFetchApp.fetch(url).getContentText();
const res = html.match(/<span id="quote_val">(.+?)<\/span>/);
if (!res) throw new Error("Value cannot be retrieved.")
return isNaN(res[1]) ? res[1] : Number(res[1]);
}
Is there a simple solution?
I would tackle this in two steps:
Use a regular expression to extract the subsection of the web page that you are interested in - specifically the "Stock Price Target" table.
Parse that <table>...</table>
string into an HTML document and then iterate over the nodes in that document to extract each relevant item from the table.
function scrapeDemo() {
var url = 'https://www.wsj.com/market-data/quotes/PCT/research-ratings';
var html = UrlFetchApp.fetch(url).getContentText();
var res = html.match(/<div class="cr_data rr_stockprice module">.+?(<table .+?<\/table>)/);
var document = XmlService.parse(res[1]);
var root = document.getRootElement();
var trNodes = root.getChild('tbody').getChildren();
trNodes.forEach((trNode) => {
var tdNodes = trNode.getChildren();
var fieldName;
var fieldValue;
tdNodes.forEach((tdNode, idx) => {
if (idx % 2 === 0) {
fieldName = tdNode.getValue().trim();
} else {
fieldValue = tdNode.getValue().trim();
console.log( fieldName + " : " + fieldValue );
}
} );
} );
}
The regular expression uses this as its starting point:
<div class="cr_data rr_stockprice module">
This is because we need a reliably unique element which is a parent of the table we want (the table itself does not contain anything which uniquely identifies it).
This gives us the table in the res[1]
captured group. Here is that HTML:
<table class="cr_dataTable">
<tbody>
<tr>
<td>
<span class="data_lbl">High</span>
</td>
<td>
<span class="data_data">
<sup>$</sup>48.00</span>
</td>
</tr>
<tr>
<td>
<span class="data_lbl">Median</span>
</td>
<td>
<span class="data_data">
<sup>$</sup>37.50</span>
</td>
</tr>
<tr>
<td>
<span class="data_lbl">Low</span>
</td>
<td>
<span class="data_data">
<sup>$</sup>24.00</span>
</td>
</tr>
<tr class="highlight">
<td>
<span class="data_lbl">Average</span>
</td>
<td>
<span class="data_data">
<sup>$</sup>36.75</span>
</td>
</tr>
<tr>
<td>
<span class="data_lbl">Current Price</span>
</td>
<td>
<span class="data_data">
<sup>$</sup>24.18</span>
</td>
</tr>
</tbody>
</table>
Now we perform step 2 using XmlService.parse()
to create a mini-XML document containing our HTML.
Then we iterate over the elements of that document, by drilling down into each level's child nodes.
Each field's value is written to the console, so for this table...
...we get this data:
High : $48.00
Median : $37.50
Low : $24.00
Average : $36.75
Current Price : $24.18
In my experience, doing this type of scraping can be difficult. Any unexpected changes in web page structure, from one page to another, can cause the regular expression to fail, or cause the drill-down into the table to fail. In other words, this type of approach should work, but it may also break unexpectedly.