phpunicodeexplodepdftotextxpdf

PHP Explode with an Unicode character as separator


XPDFs pdftotext converts pdf to text and outputs it at command line level. If needed it inserts PageBreaks between the pages as specified in TextOutputDev.cc:

eopLen = uMap->mapUnicode(0x0c, eop, sizeof(eop));

This Unicode symbol is encoding independent, -enc ASCII7 wouldn't change it. I'm currently willing to use PHP for converting and splitting the PDF file into several TXT pages for database storage. However, the following function does work, but takes twice as long as a conversion of the whole PDF in one time.

for($i = 1; $i <= $pages[0]; $i++)
    $page[$i] = shell_exec('/usr/bin/pdftotext sample.pdf -f '.$i.' -l '.$i.' -');

How am I supposed to explode(0x0c, $wholePDF) with an Unicode character as separator? Currently, page[$i] doesn't seem to retrieve those weird Unicode PageBreak characters from the shell_exec(). I tried several headers for encoding (UTF-8 especially) but it didn't work out so far.


Solution

  • 0x0c is an ASCII character (i.e. in the range 0-127), and as such in UTF-8 encoding it is represented as itself and not as a multibyte sequence. You should be able to explode(chr(0x0c), $wholePDF).