i'm trying to scrape replies to a specific tweet using Selenium and browsermobproxy. Here's my setup:
#Proxy
bmob_server = Server("./additional_lib/browsermob-proxy-2.1.4/bin/browsermob-proxy")
bmob_server.start()
bmob_proxy = bmob_server.create_proxy()
# Driver
co = webdriver.ChromeOptions()
co.add_argument('--ignore-ssl-errors=yes')
co.add_argument('--ignore-certificate-errors')
co.add_argument('--proxy-server={host}:{port}'.format(host='localhost', port=bmob_proxy.port))
driver=webdriver.Chrome(service=Service(ChromeDriverManager().install()), chrome_options=co)
After beign logged-in in Twitter i'm opening the tweet and preparing the har:
bmob_proxy.new_har(TARGET_TWEET, options={'captureHeaders': True, 'captureContent':True})
driver.get(TARGET_TWEET)
Scrollin to load more comments:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
And that's it. The json i obtain doing this is no empty so i think that there are no errors anywhere and it contains the profile pics of the replying user, and even the gifs/imaged embedded in the replies, but not the text of the reply itself.
I've tried playing with the options of the new_har
module and also using firefox instead of chrome as driver, but nothing changed.
Point is: if i open the tweet, open the dev tools (F12) and manually download the generated .har file the replies' texts are there, so i'm pretty sure that the infos i'm looking for are there and i'm missing them.
Suggestions?
That's an example of the kind of infos i'm getting using the described procedure: https://drive.google.com/file/d/1ChrbOTCUYww2lwEBuYUQRGZJGnNEND6s/view?usp=sharing
I found and implemented a solution using Python and Selenium, it is features in this repo: https://github.com/ScrPzz/twitter_replies_scraper