I'm currently attempting to scrape a wiki for some image files. I have determined that every image I want is hosted at a URL with the following structure:
https://static.wikia.nocookie.net/<game name>/images/X/XY/<file name.png>
In all cases I know the exact file name corresponding to the image I'm searching for. However, here's my issue: X
is always a one-digit hexadecimal number e.g. 3
, and XY
is always a two-digit hexadecimal number whose first digit is the same as X
, e.g. 3c
. But as far as I can tell these numbers are completely arbitrary and there is no way to reliably predict them in advance for a specific image I want to retrieve.
My plan moving forward is to search through the entire web directory until I find the files I want, check the exact URL they are stored at, and write them to a local file for instantaneous subsequent lookup. To accomplish this, I see two options:
X
and XY
, I could somehow retrieve the entire directory at .../images/X/XY/
, check what files are stored there, and write all of the URLs to a local file.X
and XY
until I find where the file is stored, and write its URL to a local file.In total I have several thousand images I want to find the URLs for. Given that, option 1 would appear to be an astronomically faster approach, but I'm not sure if retrieving an entire directory of files from the web at once is possible. Can it be done with HTTPS requests (I'm using Node.js for reference)? If not, are there any other tools I could potentially use, or will I have to resort to option 2?
It is in general not possible to predict the URLs unless the website provides directory listing or other APIs to retrieve them. However, if the URLs are predictable or derived from other known information like file name, it may be doable.
Before continuing, I should remind you that different websites have different terms of services which may or may not allow scraping of content (i.e. what you are describing). Furthermore, the media you want to download may be subject to copyright or usage licenses. Make sure to comply with ToS and copyright/licensing before downloading and using it.
Moving on: I have done a quick test on images from static.wikia.nocookie.net
and it seems like the two directory names simply come from the first hexadecimal characters of the MD5 hash of the file name.
Example:
# Full URL: https://static.wikia.nocookie.net/callofduty/images/e/e0/MW3_UAV_Recon.png
echo -n 'MW3_UAV_Recon.png' | md5sum
e05f5e0241b572f06a5246b5f201140b -
Thus if you know the game name and the file name you have all the info you need:
const crypto = require('crypto')
function calculateURL(gameName, fileName) {
const hash = crypto.createHash('md5').update(fileName).digest('hex')
return `https://static.wikia.nocookie.net/${gameName}/images/${hash[0]}/${hash[0]}${hash[1]}/${fileName}`
}