pythonweb-scrapingjupyter-notebooktwitter

scrape replies to a tweet using Python Jupyter Notebook


I've seen some questions and posts on how to scrape tweets of a specific handle, but not on how to do so to get all the replies to a particular tweet using Python via Jupyter Notebook.

Example: I want to scrape and export to Excel all the 340 replies to this public BBC tweet "Microplastics found in fresh Antarctic snow for the first time" (https://twitter.com/BBCWorld/status/1534777385249390593)

I need the following info: Reply date, Reply to (so I only get the replies to BBC, and not to other users in this thread) and the Reply text.

Inspecting the elements of the URL, I see that the reply container's class is named: css-1dbjc4n. Likewise:

  1. The Reply date's class is: css-1dbjc4n r-1loqt21 r-18u37iz r-1ny4l3l r-1udh08x r-1qhn6m8 r-i023vh r-o7ynqc r-6416eg
  2. The Reply to's class is: css-4rbku5 css-18t94o4 css-901oao r-14j79pv r-1loqt21 r-1q142lx r-37j5jr r-a023e6 r-16dba41 r-rjixqe r-bcqeeo r-3s2u2q r-qvutc0
  3. And the Reply text's class is: css-901oao css-16my406 r-poiln3 r-bcqeeo r-qvutc0

I have tried to run the code below, but the list remains empty :(

Results so far:

Empty DataFrame

Columns: [Date of Tweet, Replying to, Tweet]

Index: []

Can anyone help me, please? Many thanks! :)

Code:

import sys
sys.path.append("path to site-packages in your pc")

from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd

driver = webdriver.Chrome(executable_path=r"C:chromedriver path in your pc")

dates=[] #List to store date of tweet
replies=[] #List to store reply to info
comments=[] #List to store comments
driver.get("https://twitter.com/BBCWorld/status/1534777385249390593")

twts=[]

content = driver.page_source
soup = BeautifulSoup(content)
for a in soup.findAll('div',href=True, attrs={'class':'css-1dbjc4n'}):
    datetweet=a.find('div', attrs={'class':'css-1dbjc4n r-1loqt21 r-18u37iz r-1ny4l3l r-1udh08x r-1qhn6m8 r-i023vh r-o7ynqc r-6416eg'})
    replytweet=a.find('div', attrs={'class':'css-4rbku5 css-18t94o4 css-901oao r-14j79pv r-1loqt21 r-1q142lx r-37j5jr r-a023e6 r-16dba41 r-rjixqe r-bcqeeo r-3s2u2q r-qvutc0'})
    commenttweet=a.find('div', attrs={'class':'css-901oao css-16my406 r-poiln3 r-bcqeeo r-qvutc0'})
    dates.append(datetweet.text)
    replies.append(replytweet.text)
    comments.append(commenttweet.text) 

df = pd.DataFrame({'Date of Tweet':dates,'Replying to':replies,'Tweet':comments})
df.to_csv('tweets.csv', index=False, encoding='utf-8')
print(df)

Solution

  • I found two problems:

    1. page uses JavaScript to add elements and JavaScript may need time to add all elements to HTML - you may need time.sleep(...) before you get driver.page_source. OR use waits in Selenium to wait for some objects (before you get driver.page_source).

    2. HTML doesn't use <div href="..."> so your findAll('div', href=True, ...) is wrong. You have to remove href=True


    EDIT:

    I put code which I created but it needs also to scroll page to get more tweets. And later it may need to click Show more replies to get even more tweets.

    from selenium import webdriver
    from selenium.webdriver.chrome.service import Service
    from selenium.webdriver.common.by import By
    
    #from webdriver_manager.chrome import ChromeDriverManager
    from webdriver_manager.firefox import GeckoDriverManager
    
    import pandas as pd
    import time
    
    #driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
    driver = webdriver.Firefox(service=Service(GeckoDriverManager().install()))
    
    driver.get("https://twitter.com/BBCWorld/status/1534777385249390593")
    
    time.sleep(10)
    
    # TODO: scroll page to get more tweets
    #for _ in range(2):
    #    last = driver.find_elements(By.XPATH, '//div[@data-testid="cellInnerDiv"]')[-1]
    #    driver.execute_script("arguments[0].scrollIntoView(true)", last)
    #    time.sleep(3)
    
    all_tweets = driver.find_elements(By.XPATH, '//div[@data-testid]//article[@data-testid="tweet"]')
    
    tweets = []
    
    print(len(all_tweets)-1)
    for item in all_tweets[1:]: # skip first tweet because it is BBC tweet
        #print('--- item ---')
        #print(item.text)
    
        print('--- date ---')
        try:
            date = item.find_element(By.XPATH, './/time').text
        except:
            date = '[empty]'
        print(date)
        
        print('--- text ---')
        try:
            text = item.find_element(By.XPATH, './/div[@data-testid="tweetText"]').text
        except:
            text = '[empty]'
        print(text)
    
        print('--- replying_to ---')
    
        try:
            replying_to = item.find_element(By.XPATH, './/div[contains(text(), "Replying to")]//a').text
        except:
            replying_to = '[empty]'
        print(replying_to)
    
        tweets.append([date, replying_to, text])
    
    df = pd.DataFrame(tweets, columns=['Date of Tweet', 'Replying to', 'Tweet'])
    df.to_csv('tweets.csv', index=False, encoding='utf-8')
    print(df)