I am trying to parse link extensions out of a script tag in a webpage loaded with requests. I'm able to request the page and load the script tag as a tag element in BeautifulSoup, it looks like this:
{
id: 'heic2018b',
title: 'Galaxy NGC 2525',
width: 3657,
height: 3920,
src: 'https://cdn.esahubble.org/archives/images/thumb300y/heic2018b.jpg',
url: '/images/heic2018b/',
potw: ''
},
{
id: 'potw1345a',
title: 'Antennae Galaxies reloaded',
width: 4240,
height: 4211,
src: 'https://cdn.esahubble.org/archives/images/thumb300y/potw1345a.jpg',
url: '/images/potw1345a/',
potw: '11 November 2013'
},
{
id: 'heic0817a',
title: 'Magnetic monster NGC 1275',
width: 4633,
height: 3590,
src: 'https://cdn.esahubble.org/archives/images/thumb300y/heic0817a.jpg',
url: '/images/heic0817a/',
potw: ''
},
I'm trying to extract the strings after "url" such as "/images/heic2018b" so that I can append them to another string and request these pages to reach a link to a higher resolution image in those pages.
Here is my code so far, I am only to return each instance of "images" into an array:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://esahubble.org/images/archive/category/galaxies/page/1/'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
image_script = soup.find('script')
image_links = re.findall(r"/images/*", str(image_script))
print(image_links)
How can I get these strings into an array so that I can use them later to request the corresponding pages?
Thank you graciously for your time.
Is this what you are looking for? I have written this so it saves those strings after the url in a txt file, feel free to expand if useful.
import re
#here i copied ur links as example
js_block = r"""
{
id: 'heic2018b',
title: 'Galaxy NGC 2525',
width: 3657,
height: 3920,
src: 'https://cdn.esahubble.org/archives/images/thumb300y/heic2018b.jpg',
url: '/images/heic2018b/',
potw: ''
},
{
id: 'potw1345a',
title: 'Antennae Galaxies reloaded',
width: 4240,
height: 4211,
src: 'https://cdn.esahubble.org/archives/images/thumb300y/potw1345a.jpg',
url: '/images/potw1345a/',
potw: '11 November 2013'
},
{
id: 'heic0817a',
title: 'Magnetic monster NGC 1275',
width: 4633,
height: 3590,
src: 'https://cdn.esahubble.org/archives/images/thumb300y/heic0817a.jpg',
url: '/images/heic0817a/',
potw: ''
},
"""
#regex to capture what's inside url: '...'
pattern = r"url\s*:\s*'([^']+)'"
matches = re.findall(pattern, js_block)
#write each match to urls.txt
with open('after_urls.txt', 'w') as out_file:
for path in matches:
out_file.write(path + '\n')
print(f"Wrote {len(matches)} after URLs to after_urls.txt")
cheers!