pdfcurlwebharvest

Is it possible to retrieve a single page from PDF document via GET request?


I need to migrate a digital repository to a new platform, but lack access to the old platform so I have resorted to retrieving the objects over the web.

Some objects contain other objects. For most objects of this type, identifying/retrieving the components and their metadata is a straightforward process. But for some PDF files, it appears that the components referred to are actually references to individual pages within a single file rather than separate pages.

For example, http://content.wwu.edu/cdm4/document.php?CISOROOT=/wfront&CISOPTR=2711 gives me an object with 4 pages. http://content.wwu.edu/cgi-bin/showfile.exe?CISOROOT=/wfront&CISOPTR=2711&CISOMODE=print allows me to retrieve the entire document. http://content.wwu.edu/cgi-bin/showfile.exe?CISOROOT=/wfront&CISOPTR=2711 retrieves an XML document telling me the identifiers for the component pages, but when I try to curl them, I just get zero length docs. But using the same method when non PDF docs are involved, I get actual files -- this is why I think only individual pages are being retrieved.

How can I retrieve the individual pages, as I must store these as individual objects in the new platform? Thanks


Solution

  • Bottom line is that it appears this is only possible if there is something on the server that will extract individual pages for you.

    When I turned on wireshark, I found that actions on the user interface were invoking a call to a server side PDF application using the syntax:

    http://content.wwu.edu/cgi-bin/showpdf.exe?CISOROOT=/wfront&CISOPTR=2711&CISOPAGE=3

    where 2711 is the name of the object and 3 is the page of the file in question. Further experimentation revealed that I could pull up any page for any PDF I could identify.

    For anyone else with a similar problem, wireshark is your friend.