I am new to using pdfparser to extract text from single-page PDFs that are unencrypted. An example is here. (I'll get back to this file in a minute.) A simple page scanner is used to create these pdfs, and as such the "text" on the page is an image, rather than workable text. I expected this.
I would like to write a PHP script to accomplish this, and I plan to use pdfparser, as it is the only PDF library processor I could find for PHP that reads PDF content. (Corrections to this are welcomed.) In general, the idea is to:
I have done this successfully for a few files, but for some some (like the provided example) the image is actually stored upside-down. (In case you don't have tools to see that, this file is the extracted, untouched image.)
Clearly, tools that end up displaying this image to the user can understand and correct for the orientation, but I have been unable to locate any properties using pdfparser that will allow me to detect this. I have found Page Width and Height, and those values seem to be alike on (for lack of better term) "rightside-up" and "upside-down" images. If I could detect the vertical orientation, I could flip this image in memory as necessary and tesseract can then process it. (As a proof-of-concept, this code now does that, but unconditionally.) I am not terribly concerned about performance - this will be done only at the rate of, at most, two or three per week.
My very basic code, which does work with this file (only because I forced it to),follows. If you run this with ROTATE_IMAGE defined as false, you will see what happens without rotation. It's pretty but not useful.
#!/usr/bin/env php
<?php
require_once("pdfparser-master/alt_autoload.php-dist") ;
require_once('lovefunctions.php') ;
// In a coming version, these will be derived from command-line switches and ini settings
$pdfFile = "/Main-HDDs/GoogleDrive/Other/so-pdf-orientation.pdf" ;
$song = "TheSong" ;
$songFolder = "/Main-HDDs/GoogleDrive/Other" ;
define('TESSERACT_LOCATION', "/usr/bin/tesseract") ;
$tempFolder = sys_get_temp_dir() ;
$tempFolder = getcwd() ;
define('ROTATE_IMAGE', true) ; //<- This needs to be dynamically set
$parser = new \Smalot\PdfParser\Parser() ;
$pdf = $parser->parseFile($pdfFile) ;
$text = $pdf->getText() ; // <-- either empty or useless
$images = $pdf->getObjectsByType('XObject', 'Image') ;
foreach ($images AS $key => $image) {
$imageName = "{$tempFolder}/{$song}-{$key}.jpg" ;
$pageName = "{$songFolder}/{$song}-{$key}" ;
if (file_exists($imageName))
unlink($imageName) ;
if (file_exists($pageName))
unlink($pagename) ;
file_put_contents($imageName, $image->getContent()) ;
if (ROTATE_IT) { // Quick-and-dirty page rotation
$command = sprintf("convert -rotate 180 %s a.jpg", $imageName) ;
exec($command, $output, $rc) ;
rename("a.jpg", $imageName) ;
}
$command = sprintf("%s %s %s", TESSERACT_LOCATION, escapeshellarg($imageName), escapeshellarg($pageName)) ;
$output = [] ;
exec($command, $output, $rc) ;
if (file_exists("{$pageName}.txt")) {
rename("{$pageName}.txt", "{$pageName}.pro") ;
echo (file_get_contents("{$pageName}.pro")) ;
}
}
I realize the problem that hand-drawing/handwriting will create with this solution, but it gives me a great starting point for what I'm trying to accomplish.
So any advice regarding how to detect that upside-down image would help me significantly.
EDIT: I neglected to mention the docs include a section entitled Extract Text Positions, which mentions rotation, and the sepcific code to get that attreibute:
$data = $pdf->getPages()[0]->getDataTm();
But for these files, this always returns an empty array, so no apparent help there.
EDIT - purpose in this question, I tried to be clear that this is to be an automated procedure, and specifically asked for help with pdfparser. While uploads to websites may provide some technical details, I am not interested in those. The question in a nutshell is how to programmatically determine the orientation of an image in a PDF. Not looking for a degree in PDF details.
Here is a simple way to test if a page contains a cm
matrix with a negative component. It makes use of the extractRawData()
method that returns the list of commands contained in a page:
require 'vendor/autoload.php';
function hasFlippedImage($pdf, $pageNo = 0)
{
$page = $pdf->getPages()[$pageNo];
foreach($page->extractRawData() as $command)
{
if($command['o']=='cm')
return $command['c'][0]=='-';
}
return false;
}
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('so-pdf-orientation.pdf');
var_dump(hasFlippedImage($pdf));
Output:
bool(true)