I have been trying for a while to use the PoDoFo C++ library to extract text and lines (with their respective coordinates). But I have no way to do this.
This is what I have so far:
#include <iostream>
#include <stdio.h>
#include <vector>
#include <podofo/podofo.h>
using namespace PoDoFo;
using namespace std;
int main( int argc, char* argv[] )
{
const char* filename = "hello.pdf";
PdfVecObjects *x = new PdfVecObjects();
PdfParser parser(x, filename);
parser.ParseFile("hello.pdf");
for (TIVecObjects obj = x->begin(); obj != x->end(); obj++){
PdfObject * a = x->RemoveObject(obj);
// THIS IS MY PROBLEM VVVVVVVVVV
cout << a->Reference().ToString() << endl;
}
return 0;
}
However, this only gives me incredibly basic information (seems to be object number)
DEBUG: Size=12
DEBUG: Reading numbers: 0 12
DEBUG: Reading XRef Section: 0 with 12 Objects.
DEBUG: Size=12
DEBUG: Reading numbers: 0 12
DEBUG: Reading XRef Section: 0 with 12 Objects.
1 0 R
2 0 R
3 0 R
4 0 R
5 0 R
6 0 R
7 0 R
8 0 R
9 0 R
10 0 R
11 0 R
I want to print out the coordinates of an object, and if it's a line or text. If it's text, I would also like to be able to print out the text. Does anyone that knows this library better than I do know what I could do to fix this?
This answer will show you how to extract the text.
To get text positioning information, you will also have to process the following commands:
Tc
, Tw
, Tz
, TL
, T*
, Tr
and Tm
.
You definitely need to download the PDF spec from Adobe to get all the details. There is a chapter devoted entirely to text processing. It is well worth your time to print out that chapter as you will be referring to it a lot. Everything you need to know is in there, but it's not always obvious.
You will also need to use a bit of Linear Algebra. Nothing too complicated, though.
Since there are many ways to achieve the same results, it is important to implement all the commands thoroughly, even if the documents you are going to process might not seem to need certain features. For example: I ran across a document which set all text sizes to one point, which threw off all my calculations until I realized it was using the text scaling factor to set the actual font sizes.