Hello I have a weird problem in Python using wget, will be so grateful if someone could give me a help.
what I want to do :
download the file('.pdf','.djvu') from specific website(ex. wiki) with wget, Python. which should be easy.
specific page I'm trying to do web crawl
getting the file link for wget
Problem :
it's really weird. At most pages in website, it works well.
But some pages with same HTML structure, it doesn't work.
Even in the same page, some files downloads well with wget but some doesn't
and getting this error message
Error message :
`C:\start_automation\crawling_job>C:/Users/sa031/AppData/Local/Programs/Python/Python311/python.exe c:/start_automation/crawling_job/download_test.py
Traceback (most recent call last):
File "c:\start_automation\crawling_job\download_test.py", line 39, in <module>
wget.download(url)
File "C:\Users\sa031\AppData\Local\Programs\Python\Python311\Lib\site-packages\wget.py", line 303, in download
(fd, tmpfile) = tempfile.mkstemp(".tmp", prefix=prefix, dir=".")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\sa031\AppData\Local\Programs\Python\Python311\Lib\tempfile.py", line 341, in mkstemp
return _mkstemp_inner(dir, prefix, suffix, flags, output_type)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\sa031\AppData\Local\Programs\Python\Python311\Lib\tempfile.py", line 256, in _mkstemp_inner
fd = _os.open(file, flags, 0o600)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\start_automation\\crawling_job\\CADAL06210101_%E7%9A%87%E6%B8%85%E7%B6%93%E8%A7%A3%E7%BA%8C%E7%B7%A8%EF%BC%9A%E6%98%93%E5%9C%96%E6%A2%9D%E8%BE%AE%E7%9A%87%E6%B8%85%E7%B6%93%E8%A7%A3%E7%BA%8C%E7%B7%A8%EF%BC%9A%E8%99%9E%E6%B0%8F%E6%98%93%E4%BA%8B.djvu.3kii8ipd.tmp'`
What I have done :
googled, tested with several different pages in the wiki.
asking chatGPT and get the code with absolute path but doesn't work
import os
import wget
def download_file(url, save_path):
try:
print("Downloading file...")
wget.download(url, save_path)
print("\nDownload complete!")
except Exception as e:
print(f"An error occurred: {e}")
if __name__ == "__main__":
# URL of the file to download
file_url = "https://upload.wikimedia.org/wikipedia/commons/a/a7/CADAL06210101_%E7%9A%87%E6%B8%85%E7%B6%93%E8%A7%A3%E7%B6%8C%E7%B7%A8%EF%BC%9A%E6%98%93%E5%9C%96%E6%A2%9D%E8%BE%AE%E7%9A%87%E6%B8%85%E7%B6%93%E8%A7%A3%E7%B6%8C%E7%B7%A8%EF%BC%9A%E8%99%9E%E6%B0%8F%E6%98%93%E4%BA%8B.djvu"
# Specify an absolute path for saving the file
save_location = os.path.join(os.getcwd(), "downloaded_file.djvu")
# Call the function to download the file
download_file(file_url, save_location)
The code :
The code below is the code with URL included which doesn't work.
import wget
url='https://upload.wikimedia.org/wikipedia/commons/a/a7/CADAL06210101_%E7%9A%87%E6%B8%85%E7%B6%93%E8%A7%A3%E7%BA%8C%E7%B7%A8%EF%BC%9A%E6%98%93%E5%9C%96%E6%A2%9D%E8%BE%AE%E7%9A%87%E6%B8%85%E7%B6%93%E8%A7%A3%E7%BA%8C%E7%B7%A8%EF%BC%9A%E8%99%9E%E6%B0%8F%E6%98%93%E4%BA%8B.djvu'
wget.download(url)
maybe :
.djvu.3kii8ipd.tmp'
problem with this weird .tmp name shown on error message but have no idea.
Thanks for reading. Appreciate so much for the help.
wget appears to require a temporary location for reasons I do not understand. wget was last updated 9 years ago and may no longer be robust.
You can achieve this reliably and easily with requests as follows:
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15"
}
url = "https://upload.wikimedia.org/wikipedia/commons/a/a7/CADAL06210101_%E7%9A%87%E6%B8%85%E7%B6%93%E8%A7%A3%E7%BA%8C%E7%B7%A8%EF%BC%9A%E6%98%93%E5%9C%96%E6%A2%9D%E8%BE%AE%E7%9A%87%E6%B8%85%E7%B6%93%E8%A7%A3%E7%BA%8C%E7%B7%A8%EF%BC%9A%E8%99%9E%E6%B0%8F%E6%98%93%E4%BA%8B.djvu"
output_file = "downloaded_file.djvu"
chunk = 4096 # usually a good chunk size
with requests.get(url, headers=headers, stream=True) as response:
response.raise_for_status()
with open(output_file, "wb") as output:
for chunk in response.iter_content(chunk):
output.write(chunk)