pythonpython-asynciotelegramtelethonnest-asyncio

Python script using `asyncio` library won't stop running


I am using the libraries telethon and asyncio to scrape messages on Telegram but somehow my Python script below won't stop running. It got to the stage where it's printing out all the filtered messages correctly (under code line print(message.date, message.text, message.sender_id)), but its not outputting the csv file (under code line df.to_csv('filename.csv', encoding='utf-8')). It just keeps running.

import asyncio
import nest_asyncio
from telethon.sync import TelegramClient
import pandas as pd

username = "MY-USERNAME"   # your Telegram account username
api_id = MY-API-ID   # your Telegram account API ID
api_hash = "MY-API-HASH"   # your Telegram account API hash
phone = "MY-PHONE-NO"  # your Telegram account mob. no. with country code
channel_username = "clickhouse_en"  # channel username
start_date_value = "2023-07-01 00:00:00"  # Specify the date and time range (in UTC)
end_date_value = "2023-07-25 23:59:59"    # Specify the date and time range (in UTC)
keywords_value = [] # Specify the keywords to filter, eg. keywords_value = ["data", "report"]

# Apply nest_asyncio to enable running an event loop within a running loop
nest_asyncio.apply()  

async def main(start_date=None, end_date=None, keywords=None):
    # Convert date and time range to pandas timestamp format
    start_date = pd.Timestamp(start_date)
    end_date = pd.Timestamp(end_date)

    # Convert keywords to lowercase
    keywords = [keyword.lower() for keyword in keywords]
    
    data = [] 
    async with TelegramClient(username, api_id, api_hash) as client:
        async for message in client.iter_messages("https://t.me/" + channel_username):
            if start_date.timestamp() <= pd.Timestamp(message.date).timestamp() <= end_date.timestamp():
                # Conver message to lowercase and split the message into individual words
                words_lower_in_msg = str(message.text).lower().split()
                # 'if not keywords' means when no keywords are given, ie. keywords_value = []
                if not keywords or any(keyword == word_lower_in_msg for word_lower_in_msg in words_lower_in_msg for keyword in keywords):
                    print(message.date, message.text, message.sender_id) 
                    data.append([message.date, message.text, message.sender_id])

    # creates a new dataframe
    df = pd.DataFrame(data, columns=["message.date", "message.text", "message.sender_id"])

    # creates a csv file
    df.to_csv('filename.csv', encoding='utf-8')

# Get the event loop and run the main function
loop = asyncio.get_event_loop()
loop.run_until_complete(main(start_date = start_date_value, end_date = end_date_value, keywords = keywords_value))

I am using Jupyter Notebook, which means I need to use the nest_asyncio library in the script above.

Would really appreciate if anyone can help with the above and let me know where I had gone wrong. Many thanks in advance!


Solution

  • Take a look at this piece:

    async for message in client.iter_messages("https://t.me/" + channel_username):
        if start_date.timestamp() <= pd.Timestamp(message.date).timestamp() <= end_date.timestamp():
            # ...
    

    async for goes through all messages asynchronously and work with messages satisfying the condition. But once it handled all appropriate messages, it keeps going through all other messages until reaches the very first one (what may take a while).

    Assuming that the client yield messages from newest to oldest, you may want to stop iterating after the first message that's too old:

    async for message in client.iter_messages("https://t.me/" + channel_username):
        if start_date.timestamp() <= pd.Timestamp(message.date).timestamp() <= end_date.timestamp():
            # ...
        if start_date.timestamp() > pd.Timestamp(message.date).timestamp():
            break
    

    I think the script should be improved even further by using offset_date as shown here.