[SOLVED] Data scraping dynamic website performance

I want to scrape a newspaper archive website (genios.de) and am facing the issue that the table of contents of versions of the website are rendered dynamically once clicked. A preview window opens and the toc can be accessed. I am using Selenium and I am already getting the data that I want. The issue, however, is efficiency. This is because once I clicked one tile, the content is rendered and then remains there when I click further tiles. Is there a simple and efficient way to make sure that the "old" content is no longer available once I click on the new tile other than the extreme driver.refresh which would surely not be more efficient?

In its current form, with each click, the rendering as well as the extracting of the headlines takes longer. (from fractions of a second to up to two seconds)

Edit: I have shortened my question significantly to narrow it down to just this one aspect.

Used their API endpoint to fetch those data (which are opened in the preview window)

Please run this script and see if this is what you want (full details are not available since this is a paid archive app)

Code:

from datetime import datetime, timedelta
from bs4 import BeautifulSoup
from urllib.parse import quote
import pandas as pd
import concurrent.futures
import time, json, os, requests

out = []
file = 'file.csv'
issue_date = []
doc_link = []
doc_title = []
archive_link = []

def dateRange(start_date, end_date):
    start = datetime.strptime(start_date, "%Y-%m-%d")
    end = datetime.strptime(end_date, "%Y-%m-%d")
    date_range = pd.date_range(start,end-timedelta(days=1),freq='d')
    return date_range

def getData(urls):
    session = requests.Session()
    resp = session.get(urls).content
    soup = BeautifulSoup(resp, 'lxml')
    for i in soup.find_all('input'):
        data = i['data-db'].upper()
        url = f"https://www.genios.de/api/tableOfContents/{data}?useDateForTableOfContents=1"
        resp = requests.get(url).content
        json_data = json.loads(resp)
        for i in json_data['latestArticles']:
            issue_date.append(i['issueDateString'])
            doc_link.append(f"https://www.genios.de/document/{i['documentId']}")
            doc_title.append(i['documentTitle'])
            archive_link.append(f"https://www.genios.de{quote(i['archiveLink'])}")
        if not soup.find('input'):
            break

def read_csv(file, start_date, end_date):
    df = pd.read_csv(file)
    json_str = df.to_json(orient='records')
    json_obj = json.loads(json_str)
    for i in json_obj:
        if i['ISSUE_DATE'] in dateRange(start_date, end_date):
            print(f"--------------------\nISSUE DATE: {i['ISSUE_DATE']}\nDOCUMENT TITLE: {i['DOCUMENT_TITLE']}\nDOCUMENT LINK: {i['DOCUMENT_LINK']}\nDOCUMENT ARCHIVE LINK: {i['DOCUMENT_ARCHIVE_LINK']}")

def check_file(start_date, end_date):
    check_file_path = os.path.exists(f'./{file}')
    if check_file_path:
        read_csv(file, start_date, end_date)
        
    else:
        urls = []
        for i in range(25, 500, 25): #offset = 25,50.....n
            urls.append(f"https://www.genios.de/browse/Alle/Presse?offset={i}&partial=true&sort=BY_DATE&hasFilterList=true")
        
        #use concurrent.futures to speed up the requests process 
        
        with concurrent.futures.ThreadPoolExecutor(max_workers=100) as executor:
            future_to_url = (executor.submit(getData, url) for url in urls)
            for future in concurrent.futures.as_completed(future_to_url):
                try:
                    data = future.result()
                except Exception as exc:
                    data = str(type(exc))
                finally:
                    out.append(data)
        get_dict = {
            "ISSUE_DATE": issue_date,
            "DOCUMENT_TITLE": doc_title,
            "DOCUMENT_LINK": doc_link,
            "DOCUMENT_ARCHIVE_LINK": archive_link
        }
        df = pd.DataFrame(get_dict)
        df.to_csv(file, index=False)
        read_csv(file, start_date, end_date)
       
        
check_file('2024-1-1', '2024-10-31')

Output:

--------------------
ISSUE DATE: 2024-04-13
DOCUMENT TITLE: Fahrt in den Tod
DOCUMENT LINK: https://www.genios.de/document/NSA__f928bbe1f78a17fa3feee191fb17bf7c3bb75d09
DOCUMENT ARCHIVE LINK: https://www.genios.de/browse/Alle/Presse/Presse%20Deutschland/N%C3%BCrnberger%20Stadtanzeiger
--------------------
ISSUE DATE: 2024-04-13
DOCUMENT TITLE: Ohne Titel
DOCUMENT LINK: https://www.genios.de/document/NSA__b40ee7ca9b37d0bc3a6aeb5a7c4a778a7999980a
DOCUMENT ARCHIVE LINK: https://www.genios.de/browse/Alle/Presse/Presse%20Deutschland/N%C3%BCrnberger%20Stadtanzeiger
--------------------
ISSUE DATE: 2024-04-13
DOCUMENT TITLE: Was macht eigentlich der Verein "Hawelti"?
DOCUMENT LINK: https://www.genios.de/document/NSA__a9722a7f3ec06028563980949ba80f3c1e6c4d64
DOCUMENT ARCHIVE LINK: https://www.genios.de/browse/Alle/Presse/Presse%20Deutschland/N%C3%BCrnberger%20Stadtanzeiger
--------------------
ISSUE DATE: 2024-04-13
DOCUMENT TITLE: Überfälliges Erinnern
DOCUMENT LINK: https://www.genios.de/document/NSA__37872e8b8bd375b19455e366efb6a53714947933
DOCUMENT ARCHIVE LINK: https://www.genios.de/browse/Alle/Presse/Presse%20Deutschland/N%C3%BCrnberger%20Stadtanzeiger
--------------------
ISSUE DATE: 2024-04-13
DOCUMENT TITLE: Nadel und Nähkästchen
DOCUMENT LINK: https://www.genios.de/document/NSA__2e76bdfbcb41041bd5ab0b86d1c209ef063b1abe
DOCUMENT ARCHIVE LINK: https://www.genios.de/browse/Alle/Presse/Presse%20Deutschland/N%C3%BCrnberger%20Stadtanzeiger
--------------------
ISSUE DATE: 2024-04-13
DOCUMENT TITLE: Alles für die Oase vor der Haustür
DOCUMENT LINK: https://www.genios.de/document/NSA__f711c56c271bf48d0e6d688b2126e12bc08c1a64
DOCUMENT ARCHIVE LINK: https://www.genios.de/browse/Alle/Presse/Presse%20Deutschland/N%C3%BCrnberger%20Stadtanzeiger

Let me if this works for you!