I've created a script in python using pyppeteer to collect the links of different posts from a webpage and then parse the title of each post by going in their target page reusing those collected links. Although the content are static, I like to know how pyppeteer works in such cases.
I tried to supply this browser
variable from main()
function to fetch()
and browse_all_links()
function so that I can reuse the same browser over and over again.
My current approach:
import asyncio
from pyppeteer import launch
url = "https://stackoverflow.com/questions/tagged/web-scraping"
async def fetch(page,url):
await page.goto(url)
linkstorage = []
await page.waitForSelector('.summary .question-hyperlink')
elements = await page.querySelectorAll('.summary .question-hyperlink')
for element in elements:
linkstorage.append(await page.evaluate('(element) => element.href', element))
return linkstorage
async def browse_all_links(page,link):
await page.goto(link)
await page.waitForSelector('h1 > a')
title = await page.querySelectorEval('h1 > a','(e => e.innerText)')
print(title)
async def main():
browser = await launch(headless=False,autoClose=False)
[page] = await browser.pages()
links = await fetch(page,url)
tasks = [await browse_all_links(page,url) for url in links]
await asyncio.gather(*tasks)
if __name__ == '__main__':
asyncio.run(main())
The above script fetches some titles but spits out the following error at some point within the execution:
Possible to select <a> with specific text within the quotes?
Crawler Runs Too Slow
How do I loop a list of ticker to scrape balance sheet info?
How to retrive the url of searched video from youtbe using python
VBA-JSON to import data from all pages in one table
Is there an algorithm that detects semantic visual blocks in a webpage?
find_all only scrape the last value
#ERROR STARTS
Future exception was never retrieved
future: <Future finished exception=NetworkError('Protocol error (Runtime.releaseObject): Cannot find context with specified id')>
pyppeteer.errors.NetworkError: Protocol error (Runtime.releaseObject): Cannot find context with specified id
Future exception was never retrieved
AS it's been two days since this question has been posted but no one yet to answer, I will take this opportunity to address this issue what I think might be helpful to you.
There are 15 links but you are getting only 7, this is probably websockets is loosing connection and page is not reachable anymore
List comprehension
tasks = [await browse_all_links(page,url) for url in links]
What do expect is this list? If it's succesful, it will be a list
of none element. So your next line of code will throw error!
Solution
downgrade websockets 7.0 to websockets 6.0
remove this line of code await asyncio.gather(*tasks)
I am using python 3.6, so I had to change last line of code. You don't need change it if you are using python 3.7 which I think you are using
import asyncio
from pyppeteer import launch
url = "https://stackoverflow.com/questions/tagged/web-scraping"
async def fetch(page,url):
await page.goto(url)
linkstorage = []
await page.waitForSelector('.summary .question-hyperlink')
elements = await page.querySelectorAll('.summary .question-hyperlink')
for element in elements:
linkstorage.append(await page.evaluate('(element) => element.href', element))
return linkstorage
async def browse_all_links(page,link):
await page.goto(link)
await page.waitForSelector('h1 > a')
title = await page.querySelectorEval('h1 > a','(e => e.innerText)')
print(title)
async def main():
browser = await launch(headless=False,autoClose=False)
[page] = await browser.pages()
links = await fetch(page,url)
tasks = [await browse_all_links(page,url) for url in links]
#await asyncio.gather(*tasks)
await browser.close()
if __name__ == '__main__':
#asyncio.run(main())
asyncio.get_event_loop().run_until_complete(main())
(testenv) C:\Py\pypuppeteer1>python stack3.py Scrapy Shell response.css returns an empty array Scrapy real-time spider Why do I get KeyError while reading data with get request? Scrapy spider can't redefine custom_settings according to args Custom JS Script using Lua in Splash UI Can someone explain why and how this piece of code works [on hold] How can I extract required data from a list of strings? Scrapy CrawlSpider rules for crawling single page how to scrape a web-page with search bar results, when the search query does not appear in the url Nested for loop keeps repeating Get all tags except a list of tags BeautifulSoup Get current URL using Python and webbot How to login to site and send data Unable to append value to colums. Getting error IndexError: list index out of ra nge NextSibling.Innertext not working. “Object doesn't support this property”