seleniumpython-requestsapache-tika

Unable to access pdf document via requests or selenium


I have a huge list of URLs and each one loads a different PDF document. This is one of them: https://ccmspa.pinellascounty.org/PublicAccess/ViewDocumentFragment.aspx?DocumentFragmentID=74223655&CheckDocumentGroups=0

It will most likely open the website home page in the first try, but if you paste the link again it will open a pdf document.

I'm trying to write a python script to download those documents locally to extract contnet using tika, but this behavior where it opens the home page the first time is throwing a wrench in anything I try.

1. I tried requests, but expectedly it just returns the HTML content of home page

import requests
from tika import parser

link = "https://ccmspa.pinellascounty.org/PublicAccess/ViewDocumentFragment.aspx DocumentFragmentID=74223655&CheckDocumentGroups=0"

resp = requests.get(link)

with open('metadata.pdf', 'wb') as f:
  f.write(resp.content)

raw = parser.from_file('metadata.pdf', xmlContent=False)

print(raw['content'])
output:
\n\n\n\n\n\n\n\n\n\n    \n    \t\t\n\n\t\tSkip to Main Content\xa0\xa0\xa0\xa0Logout\xa0\xa0\xa0\xa0My 
Account\xa0\xa0\xa0\xa0\t\t\tHelp\n\n\n\n\n\n\n\t\t\t\nSelect a location\nPinellas County\n\n\xa0\nAll Case
 Records Search\nCivil, Family Case Records\nCriminal & Traffic Case Records\nProbate Case Records\nCourt 
Calendar\n\nAttorney Login\nRegistered User Login\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\n\n\t\t\
t\xa0\t\n\t\n\t\tClerk of the Circuit Court|Mortgage Foreclosure Sales|Pinellas County Government|Pinellas
 County Sheriff's Office|Public Defender|Sixth Judicial Circuit|State of Florida|State Attorney|Self Help
 Center|Court Forms|How-To Videos|Florida Courts eFiling Portal Video|Attorney Account Setup|Reports and
 Statistics|Terms of Use|Contact UsCopyright 2003 Tyler Technologies. All rights Reserved.\n\t\n\n\n\n 
   \n 

2. I tried to open the home page using Selenium, and transfer cookies from the webdriver to requests following this answer .

url = "https://ccmspa.pinellascounty.org/PublicAccess/ViewDocumentFragment.aspx?DocumentFragmentID=74223655&CheckDocumentGroups=0"

driver.get(url)

cookies = driver.get_cookies()

s = requests.Session()
for cookie in cookies:
    s.cookies.set(cookie['name'], cookie['value'])

resp = s.get(url)

It did not work, and when I checked the CookieJar of the response object it came out empty. I have to admit I have so little understanding of how cookies work, but it was just a desperate attempt. What am I misunderstanding here? I appreciate any input.

3. My last resort (for obvious reasons) was to open each document via webdriver and download the content, but even this did not work.


#opens a new window and assigns it as the working window
def open_window(driver, link):
    driver.execute_script(f"window.open('{link}')")
    new_window = driver.window_handles[-1]
    driver.switch_to.window(new_window)

url = "https://ccmspa.pinellascounty.org/PublicAccess/ViewDocumentFragment.aspx?DocumentFragmentID=74223655&CheckDocumentGroups=0"

driver.get(url)


open_window(driver, url)

#print source of new window
print(driver.page_source)

The output is just this:

<html><head></head><body></body></html>

Solution

  • After a little more tinkering, solution #2 worked. But instead of getting cookies from the driver after accessing the main page only, I had the browser start another query (with little extra steps specific to this website) then I used the cookies. It looks like this

    [{'domain': 'ccmspa.pinellascounty.org',
      'expiry': 1670679832, #this is the time the cookie expires in epoch time
      'httpOnly': True,
      'name': '.ASPXFORMSPUBLICACCESS',
      'path': '/',
      'secure': True,
      'value': '1DBB1EADBA199D246E84CCE7243202DCA6BBD7E383FE360ECBFC2E6150102C79F3EC2F6B232B85589C51976AF20EF7EBDF52CF74122A7A6E78B4C6F31434C58AB57E10005C41DE019814B704F12B150A0818585E85F0237EFCF1A11B205414325CA1850605FF932BC43CC5B36395488F40D58DA594899C4D62FF3ECCBE729C6BC001194225B6653CB89C1305C7FBCB26E1BCFCFF75476784D24ADFCA0AFF679A3BAA3131'},
     {'domain': 'ccmspa.pinellascounty.org',
      'httpOnly': True,
      'name': 'ASP.NET_SessionId',
      'path': '/',
      'secure': True,
      'value': '24552pqtb1tomjbw2gkzko55'},
     {'domain': 'ccmspa.pinellascounty.org',
      'httpOnly': False,
      'name': 'EDLFDCVM',
      'path': '/',
      'sameSite': 'None',
      'secure': True,
      'value': '02282de498-9595-48s0hGpl59SkUKRZpRrS_b1TKJfXlz_3dGN9xGZ2tcTXrHuDsR5rN90I_Rp192pX48C1k'}]