pythonselenium-chromedriverbrowsermob-proxyhar

Cannot find tweet's replies bodies in .har created with Selenium and browsermobproxy in python


i'm trying to scrape replies to a specific tweet using Selenium and browsermobproxy. Here's my setup:

#Proxy

bmob_server = Server("./additional_lib/browsermob-proxy-2.1.4/bin/browsermob-proxy")
bmob_server.start()
bmob_proxy = bmob_server.create_proxy()

# Driver
co = webdriver.ChromeOptions()
co.add_argument('--ignore-ssl-errors=yes')
co.add_argument('--ignore-certificate-errors')
co.add_argument('--proxy-server={host}:{port}'.format(host='localhost', port=bmob_proxy.port))
driver=webdriver.Chrome(service=Service(ChromeDriverManager().install()), chrome_options=co)

After beign logged-in in Twitter i'm opening the tweet and preparing the har:

 bmob_proxy.new_har(TARGET_TWEET, options={'captureHeaders': True, 'captureContent':True})
 driver.get(TARGET_TWEET) 

Scrollin to load more comments:

 driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

And that's it. The json i obtain doing this is no empty so i think that there are no errors anywhere and it contains the profile pics of the replying user, and even the gifs/imaged embedded in the replies, but not the text of the reply itself.

I've tried playing with the options of the new_har module and also using firefox instead of chrome as driver, but nothing changed. Point is: if i open the tweet, open the dev tools (F12) and manually download the generated .har file the replies' texts are there, so i'm pretty sure that the infos i'm looking for are there and i'm missing them.

Suggestions?

EDIT

That's an example of the kind of infos i'm getting using the described procedure: https://drive.google.com/file/d/1ChrbOTCUYww2lwEBuYUQRGZJGnNEND6s/view?usp=sharing


Solution

  • I found and implemented a solution using Python and Selenium, it is features in this repo: https://github.com/ScrPzz/twitter_replies_scraper