I am using the libraries telethon
and asyncio
to scrape messages on Telegram but somehow my Python script below won't stop running. It got to the stage where it's printing out all the filtered messages correctly (under code line print(message.date, message.text, message.sender_id)
), but its not outputting the csv file (under code line df.to_csv('filename.csv', encoding='utf-8')
). It just keeps running.
import asyncio
import nest_asyncio
from telethon.sync import TelegramClient
import pandas as pd
username = "MY-USERNAME" # your Telegram account username
api_id = MY-API-ID # your Telegram account API ID
api_hash = "MY-API-HASH" # your Telegram account API hash
phone = "MY-PHONE-NO" # your Telegram account mob. no. with country code
channel_username = "clickhouse_en" # channel username
start_date_value = "2023-07-01 00:00:00" # Specify the date and time range (in UTC)
end_date_value = "2023-07-25 23:59:59" # Specify the date and time range (in UTC)
keywords_value = [] # Specify the keywords to filter, eg. keywords_value = ["data", "report"]
# Apply nest_asyncio to enable running an event loop within a running loop
nest_asyncio.apply()
async def main(start_date=None, end_date=None, keywords=None):
# Convert date and time range to pandas timestamp format
start_date = pd.Timestamp(start_date)
end_date = pd.Timestamp(end_date)
# Convert keywords to lowercase
keywords = [keyword.lower() for keyword in keywords]
data = []
async with TelegramClient(username, api_id, api_hash) as client:
async for message in client.iter_messages("https://t.me/" + channel_username):
if start_date.timestamp() <= pd.Timestamp(message.date).timestamp() <= end_date.timestamp():
# Conver message to lowercase and split the message into individual words
words_lower_in_msg = str(message.text).lower().split()
# 'if not keywords' means when no keywords are given, ie. keywords_value = []
if not keywords or any(keyword == word_lower_in_msg for word_lower_in_msg in words_lower_in_msg for keyword in keywords):
print(message.date, message.text, message.sender_id)
data.append([message.date, message.text, message.sender_id])
# creates a new dataframe
df = pd.DataFrame(data, columns=["message.date", "message.text", "message.sender_id"])
# creates a csv file
df.to_csv('filename.csv', encoding='utf-8')
# Get the event loop and run the main function
loop = asyncio.get_event_loop()
loop.run_until_complete(main(start_date = start_date_value, end_date = end_date_value, keywords = keywords_value))
I am using Jupyter Notebook, which means I need to use the nest_asyncio
library in the script above.
Would really appreciate if anyone can help with the above and let me know where I had gone wrong. Many thanks in advance!
Take a look at this piece:
async for message in client.iter_messages("https://t.me/" + channel_username):
if start_date.timestamp() <= pd.Timestamp(message.date).timestamp() <= end_date.timestamp():
# ...
async for
goes through all messages asynchronously and work with messages satisfying the condition. But once it handled all appropriate messages, it keeps going through all other messages until reaches the very first one (what may take a while).
Assuming that the client yield messages from newest to oldest, you may want to stop iterating after the first message that's too old:
async for message in client.iter_messages("https://t.me/" + channel_username):
if start_date.timestamp() <= pd.Timestamp(message.date).timestamp() <= end_date.timestamp():
# ...
if start_date.timestamp() > pd.Timestamp(message.date).timestamp():
break
I think the script should be improved even further by using offset_date
as shown here.