pythontwittertweepypython-twitter

tweepy.Cursor returns unrelated search results to the query I selected


As part of my Master's Degree, I need to collect data from Twitter for future machine learning models.

What's the problem?

I am trying to get tweets with a given hashtag (#), something really simple such as #climatechange, so as I understood from other questions at stack overflow, I need to add q parameter and pass the "#climatechange" string there.

Here is the code:

# Loads JSON Credentials.
twitter_credentials_json = load_twitter_credentials('TwitterCredentials.json')

# Creates tweepy.API object.
auth = tweepy.OAuthHandler(twitter_credentials_json['consumer_key'], twitter_credentials_json['consumer_secret'])
auth.set_access_token(twitter_credentials_json['access_token'], twitter_credentials_json['access_token_secret'])
api = tweepy.API(auth, wait_on_rate_limit=True)

data_list = []

# Iterates through the required tweets and adds them to the list.
for tweet in tweepy.Cursor(api.search, q="#climatechange", since="2020-01-01", until="2020-10-01").items(100):
  data_list.append(tweet._json)
# Drops everything to the file system.
with open(f"Tweets {get_datetime_as_string()}.json", 'w', encoding='utf8') as outfile:
  outfile.write(json.dumps(data_list))
  outfile.close()

As you can see I am searching at Twitter, I require every text that contains the string "#climatechange", since 2020-01-01, until 2020-10-01, and I take 100 items. Now I open the JSON file and I see some unrelated tweets in the JSON file, that doesn't contain "#climatechange" text. I decided to check at the whole object that I received from tweepy and there is also no mention for "#climatechange" string anywhere.

For example:

"text": "RT @BetteMidler: The #GOP cannot govern. Remember they presided over #9-11, the #IraqWar, the 2008 #GreatRecession, & when they returned t\u2026"

"text": "RT @DeWayne_Fulton: #Texas can lead the way in energy innovation--safe, clean, efficient, renewable energy.\n\n@Lizzie4Congress knows that th\u2026",

To summarize it until now:

  1. I get tweets from twitter by specific conditions.
  2. I save them to the file system.
  3. I open the JSON file and about 10% of the tweets don't have the "#climatechange" string in them.

What I tried to solve this issue?

  1. Of course, the first thing I tried to do is going to tweepy official documentation for the Cursor object but I didn't find anything useful there, I didn't even find the q parameter or anything else, although many stack overflow solutions use those parameters. http://docs.tweepy.org/en/v3.9.0/cursor_tutorial.html It seems like the documentation isn't fully written or missing a lot of stuff, where did I go wrong with the documentation?

  2. I searched at Stack Overflow and some more sites if someone had this issue too, but I didn't find anything relevant.

  3. I searched for tweepy.Cursor solutions at StackOverflow to adjust my parameters and I tried adding someone parameters, removing some but still, nothing.

  4. I tried going to tweepy.Cursor GitHub code to understand how it works but I didn't fully understand how it works so no success there.


As I understand once I specify the "q" parameter with some string it will search for strings that contain this query parameter and return only the valid tweets, but as I see it there is some problem and it returns unrelated tweets.

I will be happy to get some help or maybe if you can tell me what I miss, I am sure it's something small that I miss and that's the reason I don't get the right data.

Thanks.


Solution

  • It’s very likely that the Tweets that seem unrelated are truncated to 140 characters and the text you’ve searched for is in the “extended” Tweet section. If you add tweet_mode=extended to the api.search call then it should retrieve the full Tweet text in the full_text field for the extended Tweets.

    You should also be aware that the legacy standard Twitter search API (this is what api.search is calling) only supports searching back within the past 7 days of Tweets. For a longer period of time you will need to use the Twitter premium 30-day or full-archive search APIs.