javascriptpythonhtmlweb-scrapingpython-requests

Get data hidden in ellipses while web scraping


I'm attempting to grab episode title shown at the header of this website. When inspecting the page elements myself I can see near the top a line of HTML like this:

<h1 id="epName">...</h1>

which when clicking on the ellipses opens to

<h1 id="epName">Friendship is Magic, Part 1</h1>

I've attempted to automate this so I can save the corresponding episodes as their actual title as opposed to a season-episode code I'm currently using

I've tried basic request calling

url ='https://fim.heartshine.gay/?s=1&e=1&res=480&lo=0'
x = requests.get(url)
text = x.text
print(text)

but the important result of that was

</head>
<body onload="initPage();">
<h1 id="epName"></h1> <div>

with no actual info between the h1 tags.

I've also tried Selenium as I've guessed this might be a JavaScript enabled function:

from selenium import webdriver
driver = webdriver.Safari()
driver.get("https://g1.heartshine.gay/?s=1&e=46&res=480")
print(dir(driver))
driver.execute_script('changeEp') #this button controls the resulting epName
p_element = driver.page_source
print(p_element)

but again I get the same relevant output from above


Solution

  • You don't need selenium here, as the data is fetched dynamically from this JSON file. You can use requests.get(url).json:

    import requests
    
    url = 'https://fim.heartshine.gay/db.json'
    data = requests.get(url).json()
    

    On how you locate such a source, see e.g. here. The fetching is done here.

    The title for season 1 (s=1), episode 1 (e=1) would be:

    data['series']['seasons'][0]['episodes'][0]['epTitle']
    
    # 'Friendship is Magic, Part 1'
    

    But it might be useful to store the data in a pd.DataFrame. E.g., using pd.json_normalize, you could do something like:

    import pandas as pd
    
    seasons = data['series']['seasons']
    
    cols = ['seasNum', 'epNum', 'epTitle']
    df = (pd.json_normalize(seasons, 
                            record_path='episodes', 
                            meta=['seasNum'])
          [cols]
          )
    

    Output (reading head and tail with np.r_):

    import numpy as np
    
    df.iloc[np.r_[0:2, -2:0]]
    
        seasNum  epNum                      epTitle
    0         1      1  Friendship is Magic, Part 1
    1         1      2  Friendship is Magic, Part 2
    260      14     22               Hat in the Way
    261      14     23      Pony Life - New Series!