pythonpdfpython-requestsurllib3data-collection

Problem with PDF download that I cannot open


I am working on a script to extract text from law cases using https://case.law/docs/site_features/api. I have created methods for search and create-xlsx, which work well, but I am struggling with the method to open an online pdf link, write (wb) in a temp file, read and extract the data (core text), then close it. The ultimate goal is to use the content of these cases for NLP.

I have prepared a function (see below) to download the file:

def download_file(file_id):
    http = urllib3.PoolManager()
    folder_path = "path_to_my_desktop"
    file_download = "https://cite.case.law/xxxxxx.pdf"
    file_content = http.request('GET', file_download)
    file_local = open( folder_path + file_id + '.pdf', 'wb' )
    file_local.write(file_content.read())
    file_content.close()
    file_local.close()

The script works well as it download the file and it created on my desktop, but, when I try to open manually the file on the desktop I have this message from acrobat reader:

Adobe Acrobat Reader could not open 'file_id.pdf' because it is either not a supported file type or because the file has been damager (for example, it was sent as a email attachments and wasn't correctly decoded

I thought it was the Library so I tried with Requests / xlswriter / urllib3... (example below - I also tried to read it from the script to see whether it was Adobe that was the issue, but apparently not)

# Download the pdf from the search results
URL = "https://cite.case.law/xxxxxx.pdf"
r = requests.get(URL, stream=True)
with open('path_to_desktop + pdf_name + .pdf', 'w') as f:
      f.write(r.text)

# open the downloaded file and remove '<[^<]+?>' for easier reading
with open('C:/Users/amallet/Desktop/r.pdf', 'r') as ff:
      data_read = ff.read()
      stripped = re.sub('<[^<]+?>', '', data_read)
      print(stripped)

the output is:

document.getElementById('next').value = document.location.toString();
document.getElementById('not-a-bot-form').submit();

with 'wb'and 'rb' instead (and removing the *** stripped *** the sript is:

r = requests.get(test_case_pdf, stream=True)
with open('C:/Users/amallet/Desktop/r.pdf', 'wb') as f:
      f.write(r.content)

with open('C:/Users/amallet/Desktop/r.pdf', 'rb') as ff:
      data_read = ff.read()
      print(data_read)

and the output is :

<html>
<head>
<noscript>
<meta http-equiv="Refresh" content="0;URL=?no_js=1&next=/pdf/7840543/In%20re%20the%20Extradition%20of%20Garcia,%20890%20F.%20Supp.%20914%
20(1994).pdf" />
</noscript>
</head>
<body>
<form method="post" id="not-a-bot-form">
<input type="hidden" name="csrfmiddlewaretoken" value="5awGW0F4A1b7Y6bx
rYBaA6GIvqx4Tf6DnK0qEMLVoJBLoA3ZqOrpMZdUXDQ7ehOz">
<input type="hidden" name="not_a_bot" value="yes">
<input type="hidden" name="next" value="/pdf/7840543/In%20re%20
the%20Extradition%20of%20Garcia,%20890%20F.%20Supp.%20914%20(1994).pdf" id="next">
</form>
<script>
document.getElementById(\'next\').value = document.loc
ation.toString();
document.getElementById(\'not-a-bot-form\').submit();
</script>
<a href="?no_js=1&next=/pdf/7840543/In%20re%20the%20Extradition%20of%20Garcia,%2
0890%20F.%20Supp.%20914%20(1994).pdf">Click here to continue</a>
</body>
</html>

but none are working. The pdf is not protected by a password, and I tried on other website and it doesn't work either.

Therefore, I am wondering whether I have another issue that is not link to the code itself.

Please let me know if you need additional information.

thank you


Solution

  • It looks like instead of the PDF the web server is providing you with a web page intended to prevent bots from downloading data from the site.

    There is nothing wrong with your code, but if you still want to do this you'll have to work around the bot prevention of the website.