pythonseleniumselenium-webdriverwebdriverwaitwindow-handles

How to use Selenium Python to get a field information of each linked page


The context is springerlink. For example this series of books GTM

I want to get the information located at the bottom of each book's webpage:

book info

All I want is the E-ISBN information on each page.

Is there's a way(not limited to selenium) that enumerate each book page and get the information?


Solution

  • For this easy task you can use both Selenium and BeautifulSoup, but the latter is easier and faster so let's use it to get title and E-ISBN codes.

    First install BeautifulSoup with the command pip install beautifulsoup4.

    Method 1 (faster): get E-ISBN directly from books list

    Notice that in the books list for each book there is an eBook link, which is something like https://www.springer.com/book/9783031256325 where 9783031256325 is the EISBN code without the - characters.

    enter image description here

    So we can get the EISBN codes directly from those urls, without the need to load a new page for each book:

    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://www.springer.com/series/136/books'
    soup = BeautifulSoup(requests.get(url).text, "html.parser")
    titles = [title.text.strip() for title in soup.select('.c-card__title')]
    EISBN = []
    for a in soup.select('ul:last-child .c-meta__item:last-child a'):
        c = a['href'].split('/')[-1] # a['href'] is something like https://www.springer.com/book/9783031256325
        EISBN.append( f'{c[:3]}-{c[3]}-{c[4:7]}-{c[7:12]}-{c[-1]}' ) # insert four '-' in the number 9783031256325 to create the E-ISBN code
    
    for i in range(len(titles)):
        print(EISBN[i],titles[i])
    

    Output

    978-3-031-25632-5 Random Walks on Infinite Groups
    978-3-031-19707-9 Drinfeld Modules
    978-3-031-13379-4 Partial Differential Equations
    978-3-031-00943-3 Stationary Processes and Discrete Parameter Markov Processes
    978-3-031-14205-5 Measure Theory, Probability, and Stochastic Processes
    978-3-030-56694-4 Quaternion Algebras
    978-3-030-73839-6 Mathematical Logic
    978-3-030-71250-1 Lessons in Enumerative Combinatorics
    978-3-030-35118-2 Basic Representation Theory of Algebras
    978-3-030-59242-4 Ergodic Dynamics
    

    Method 2 (slower): get E-ISBN by loading a page for each book

    This method load the details page for each book and extract from there the EISBN code:

    soup = BeautifulSoup(requests.get(url).text, "html.parser")
    books = soup.select('a[data-track-label^="article"]')
    titles, EISBN = [], []
    
    for book in books:
        titles.append(book.text.strip())
        soup_book = BeautifulSoup(requests.get(book['href']).text, "html.parser")
        EISBN.append( soup_book.select('p:has(span[data-test=electronic_isbn_publication_date]) .c-bibliographic-information__value')[0].text )
    

    If you are wondering p:has(span[data-test=electronic_isbn_publication_date]) select the parent p of the span having attribute data-test=electronic_isbn_publication_date.