perlcgixpdf

XPDF pdftotext and page number handling


Using perl to utilize pdftotext for the purpose of extracting text from a pdf. Works great. My issue is that the pdf's I am reading are multi-page and I am looking for data on specific lines at the top each page. The following code dumps the entire contents of both pages to one file. Because the data length after the constant data (at the top of page) varies I can't accurately pull my data from page 2. How would I step through each page either using pdftotext or some other utility/module first, then call pdftotext on each page individually?

#!/usr/bin/perl
print "Content-type: text/html\n\n";

print "\n<style>
div.line {width:100%;white-space:nowrap;}
div.line div {width:80px;float:left;}
</style>";

my $i=0;
open FILE, "pdftotext -layout my_multi_page_pdf.pdf - |";

while (<FILE>) {

    $i++;
    my ($line) = $_;
    print "\n<div class=\"line\"><div>$i</div>$line</div>";
}
close FILE;

Solution

  • use strict;
    use warnings;
    
    my $i       = 0;
    my $pageNum = 1;
    
    open my $fh, "pdftotext -layout multipage.pdf - |" or die $!;
    print "---------- Begin Page $pageNum ----------\n";
    
    while ( my $line = <$fh> ) {
        if ( $line =~ /\xC/ ) {
            print "\n---------- End Page $pageNum ----------\n";
            $pageNum++;
            print "---------- Begin Page $pageNum ----------\n";
        }
    
        $i++;
        print "\n<div class=\"line\"><div>$i</div>$line</div>";
    }
    
    close $fh;