pythonregexparsingbeautifulsouppython-requests

How to parse link extensions out of a script tag in BeautifulSoup with RegEx in Python


I am trying to parse link extensions out of a script tag in a webpage loaded with requests. I'm able to request the page and load the script tag as a tag element in BeautifulSoup, it looks like this:

{
    id: 'heic2018b',
    title: 'Galaxy NGC 2525',
    width: 3657,
    height: 3920,
    src: 'https://cdn.esahubble.org/archives/images/thumb300y/heic2018b.jpg',
    url: '/images/heic2018b/',
    potw: ''
},

{
    id: 'potw1345a',
    title: 'Antennae Galaxies reloaded',
    width: 4240,
    height: 4211,
    src: 'https://cdn.esahubble.org/archives/images/thumb300y/potw1345a.jpg',
    url: '/images/potw1345a/',
    potw: '11 November 2013'
},

{
    id: 'heic0817a',
    title: 'Magnetic monster NGC 1275',
    width: 4633,
    height: 3590,
    src: 'https://cdn.esahubble.org/archives/images/thumb300y/heic0817a.jpg',
    url: '/images/heic0817a/',
    potw: ''
},

I'm trying to extract the strings after "url" such as "/images/heic2018b" so that I can append them to another string and request these pages to reach a link to a higher resolution image in those pages.

Here is my code so far, I am only to return each instance of "images" into an array:

import re
import requests
from bs4 import BeautifulSoup

url = 'https://esahubble.org/images/archive/category/galaxies/page/1/'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
image_script = soup.find('script')
image_links = re.findall(r"/images/*", str(image_script))

print(image_links)

How can I get these strings into an array so that I can use them later to request the corresponding pages?

Thank you graciously for your time.


Solution

  • Is this what you are looking for? I have written this so it saves those strings after the url in a txt file, feel free to expand if useful.

    
    import re
    
    #here i copied ur links as example
    js_block = r"""
    {
        id: 'heic2018b',
        title: 'Galaxy NGC 2525',
        width: 3657,
        height: 3920,
        src: 'https://cdn.esahubble.org/archives/images/thumb300y/heic2018b.jpg',
        url: '/images/heic2018b/',
        potw: ''
    },
    
    {
        id: 'potw1345a',
        title: 'Antennae Galaxies reloaded',
        width: 4240,
        height: 4211,
        src: 'https://cdn.esahubble.org/archives/images/thumb300y/potw1345a.jpg',
        url: '/images/potw1345a/',
        potw: '11 November 2013'
    },
    
    {
        id: 'heic0817a',
        title: 'Magnetic monster NGC 1275',
        width: 4633,
        height: 3590,
        src: 'https://cdn.esahubble.org/archives/images/thumb300y/heic0817a.jpg',
        url: '/images/heic0817a/',
        potw: ''
    },
    """
    
    #regex to capture what's inside url: '...'
    pattern = r"url\s*:\s*'([^']+)'"
    matches = re.findall(pattern, js_block)
    
    #write each match to urls.txt
    with open('after_urls.txt', 'w') as out_file:
        for path in matches:
            out_file.write(path + '\n')
    
    print(f"Wrote {len(matches)} after URLs to after_urls.txt")
    

    cheers!