pythonselenium-webdriverweb-scraping

Data scraping dynamic website performance


I want to scrape a newspaper archive website (genios.de) and am facing the issue that the table of contents of versions of the website are rendered dynamically once clicked. A preview window opens and the toc can be accessed. I am using Selenium and I am already getting the data that I want. The issue, however, is efficiency. This is because once I clicked one tile, the content is rendered and then remains there when I click further tiles. Is there a simple and efficient way to make sure that the "old" content is no longer available once I click on the new tile other than the extreme driver.refresh which would surely not be more efficient?

In its current form, with each click, the rendering as well as the extracting of the headlines takes longer. (from fractions of a second to up to two seconds)

Edit: I have shortened my question significantly to narrow it down to just this one aspect.


Solution

  • Used their API endpoint to fetch those data (which are opened in the preview window)

    Please run this script and see if this is what you want (full details are not available since this is a paid archive app)

    Code:

    from datetime import datetime, timedelta
    from bs4 import BeautifulSoup
    from urllib.parse import quote
    import pandas as pd
    import concurrent.futures
    import time, json, os, requests
    
    out = []
    file = 'file.csv'
    issue_date = []
    doc_link = []
    doc_title = []
    archive_link = []
    
    def dateRange(start_date, end_date):
        start = datetime.strptime(start_date, "%Y-%m-%d")
        end = datetime.strptime(end_date, "%Y-%m-%d")
        date_range = pd.date_range(start,end-timedelta(days=1),freq='d')
        return date_range
    
    def getData(urls):
        session = requests.Session()
        resp = session.get(urls).content
        soup = BeautifulSoup(resp, 'lxml')
        for i in soup.find_all('input'):
            data = i['data-db'].upper()
            url = f"https://www.genios.de/api/tableOfContents/{data}?useDateForTableOfContents=1"
            resp = requests.get(url).content
            json_data = json.loads(resp)
            for i in json_data['latestArticles']:
                issue_date.append(i['issueDateString'])
                doc_link.append(f"https://www.genios.de/document/{i['documentId']}")
                doc_title.append(i['documentTitle'])
                archive_link.append(f"https://www.genios.de{quote(i['archiveLink'])}")
            if not soup.find('input'):
                break
    
    def read_csv(file, start_date, end_date):
        df = pd.read_csv(file)
        json_str = df.to_json(orient='records')
        json_obj = json.loads(json_str)
        for i in json_obj:
            if i['ISSUE_DATE'] in dateRange(start_date, end_date):
                print(f"--------------------\nISSUE DATE: {i['ISSUE_DATE']}\nDOCUMENT TITLE: {i['DOCUMENT_TITLE']}\nDOCUMENT LINK: {i['DOCUMENT_LINK']}\nDOCUMENT ARCHIVE LINK: {i['DOCUMENT_ARCHIVE_LINK']}")
    
    def check_file(start_date, end_date):
        check_file_path = os.path.exists(f'./{file}')
        if check_file_path:
            read_csv(file, start_date, end_date)
            
        else:
            urls = []
            for i in range(25, 500, 25): #offset = 25,50.....n
                urls.append(f"https://www.genios.de/browse/Alle/Presse?offset={i}&partial=true&sort=BY_DATE&hasFilterList=true")
            
            #use concurrent.futures to speed up the requests process 
            
            with concurrent.futures.ThreadPoolExecutor(max_workers=100) as executor:
                future_to_url = (executor.submit(getData, url) for url in urls)
                for future in concurrent.futures.as_completed(future_to_url):
                    try:
                        data = future.result()
                    except Exception as exc:
                        data = str(type(exc))
                    finally:
                        out.append(data)
            get_dict = {
                "ISSUE_DATE": issue_date,
                "DOCUMENT_TITLE": doc_title,
                "DOCUMENT_LINK": doc_link,
                "DOCUMENT_ARCHIVE_LINK": archive_link
            }
            df = pd.DataFrame(get_dict)
            df.to_csv(file, index=False)
            read_csv(file, start_date, end_date)
           
            
    check_file('2024-1-1', '2024-10-31')
    

    Output:

    --------------------
    ISSUE DATE: 2024-04-13
    DOCUMENT TITLE: Fahrt in den Tod
    DOCUMENT LINK: https://www.genios.de/document/NSA__f928bbe1f78a17fa3feee191fb17bf7c3bb75d09
    DOCUMENT ARCHIVE LINK: https://www.genios.de/browse/Alle/Presse/Presse%20Deutschland/N%C3%BCrnberger%20Stadtanzeiger
    --------------------
    ISSUE DATE: 2024-04-13
    DOCUMENT TITLE: Ohne Titel
    DOCUMENT LINK: https://www.genios.de/document/NSA__b40ee7ca9b37d0bc3a6aeb5a7c4a778a7999980a
    DOCUMENT ARCHIVE LINK: https://www.genios.de/browse/Alle/Presse/Presse%20Deutschland/N%C3%BCrnberger%20Stadtanzeiger
    --------------------
    ISSUE DATE: 2024-04-13
    DOCUMENT TITLE: Was macht eigentlich der Verein "Hawelti"?
    DOCUMENT LINK: https://www.genios.de/document/NSA__a9722a7f3ec06028563980949ba80f3c1e6c4d64
    DOCUMENT ARCHIVE LINK: https://www.genios.de/browse/Alle/Presse/Presse%20Deutschland/N%C3%BCrnberger%20Stadtanzeiger
    --------------------
    ISSUE DATE: 2024-04-13
    DOCUMENT TITLE: Überfälliges Erinnern
    DOCUMENT LINK: https://www.genios.de/document/NSA__37872e8b8bd375b19455e366efb6a53714947933
    DOCUMENT ARCHIVE LINK: https://www.genios.de/browse/Alle/Presse/Presse%20Deutschland/N%C3%BCrnberger%20Stadtanzeiger
    --------------------
    ISSUE DATE: 2024-04-13
    DOCUMENT TITLE: Nadel und Nähkästchen
    DOCUMENT LINK: https://www.genios.de/document/NSA__2e76bdfbcb41041bd5ab0b86d1c209ef063b1abe
    DOCUMENT ARCHIVE LINK: https://www.genios.de/browse/Alle/Presse/Presse%20Deutschland/N%C3%BCrnberger%20Stadtanzeiger
    --------------------
    ISSUE DATE: 2024-04-13
    DOCUMENT TITLE: Alles für die Oase vor der Haustür
    DOCUMENT LINK: https://www.genios.de/document/NSA__f711c56c271bf48d0e6d688b2126e12bc08c1a64
    DOCUMENT ARCHIVE LINK: https://www.genios.de/browse/Alle/Presse/Presse%20Deutschland/N%C3%BCrnberger%20Stadtanzeiger
    

    Let me if this works for you!