exist-dbxquery-3.1

exist-db how to access a pdf


I am sure it is very simple ... I just cannot get my head around this... the exist-db Documentation is a bit fuzzy on content extraction... http://exist-db.org/exist/apps/doc/contentextraction.

I have a pdf-file, containing of about 162 high-res images (the pdf is quite big ...) and I do not know how to access any of the that are presumably created ...

please do not destroy me! I am just starting to build a database (for an Edition at Uni)I'd love to have a facsimile edition (so one Tab with the image-file and one tab with the transcribed texts)

I aim at doing something similar to what Heidelberg Universitdy did with the "Welsche Gast Digital" http://digi.ub.uni-heidelberg.de/diglit/cpg389/0190/image (the choosen image is just an example! ) This pic When clicking on faksimile the Scan opens and when clicking on Transkription the transcribed texts open!

I am quite new to Xquery, Xpath and most X-related stuff. I have a "working design" put together in exist-db and am looking at TEI for marking up the transcritpion etc, I fear I'll have to spend quite some time on this issue ... (it is not about doing my job for me, it's just about pointing me in the right direction)


Solution

  • I m afraid the short answer is simply don't.

    Storing a pdf in your db, and then trying to extract images from it, is kind of a recipe for disaster. Instead you should use the source images (not necessarily extracted from the pdf), and store these individually in a collection (e.g. resources/img). Those image files are then the binary resources that the documentation is actually talking about.

    You might want to take a look at tei-publisher for creating digital edition in exist, especially this demo app for how to present high-res facsimiles with transcribed portions of text. I m afraid its all a bit more involved then just opening a pdf in a browser, but so is the Welsche Gast Digital