phphtmlshellubuntupdf-to-html

Image is always at top in converted html from pdf


I am using following code and all contents of the specific pdf page are converting in a correct manner. But if there is any image in the middle of pdf page, that image in the HTML shows at the top.

PHP CODE:

umask(0);
$output = shell_exec('pdftohtml create.pdf create.html');

Edit:

Please check the pdf what I used for this: https://www.dropbox.com/s/6uy9wq27ff00n0x/create.pdf?dl=0

In this PDF, image is after 2 lines.

// Load the converted html page. shell_exec adds 's' to html file, creates.html

$html = file_get_contents('creates.html');
print_r($html);

// output

<!DOCTYPE html><html>
<head>
</head>
<body>
<img src="/var/www/html/pdf-sign/public/converted_path/create-1_1.png"/><br/>
Test document PDF&#160;<br/>&#160;<br/>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla est purus, ultrices in porttitor&#160;<br/>in, accumsan non quam. Nam consectetur porttitor rhoncus. Curabitur eu est et leo feugiat&#160;<br/>auctor vel quis lorem. Ut et ligula dolor, sit amet consequat lorem. Aliquam porta eros sed&#160;<br/>velit imperdiet egestas. Maecenas tempus eros ut diam ullamcorper id dictum libero&#160;<br/>tempor. Donec quis augue quis magna condimentum lobortis. Quisque imperdiet ipsum vel&#160;<br/>magna viverra rutrum. Cras viverra molestie urna, vitae vestibulum turpis varius id.&#160;<br/>&#160; &#160;PLACEHOLDER &#160; &#160; &#160;<br/>nulla ac dolor. Maecenas urna elit, tincidunt in dapibus nec, vehicula eu dui. Duis lacinia&#160;<br/>fringilla massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur&#160;<br/>
suscipit felis eget condimentum. Cum sociis natoque penatibus et magnis dis parturient&#160;<br/>montes, nascetur ridiculus mus. Integer bibendum sagittis ligula, non faucibus nulla volutpat&#160;<br/>vitae. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. &#160;<br/>In aliquet quam et velit bibendum accumsan. Cum sociis natoque penatibus et magnis dis&#160;<br/>parturient montes, nascetur ridiculus mus. Vestibulum vitae ipsum nec arcu semper&#160;<br/>adipiscing at ac lacus. Praesent id pellentesque orci. Morbi congue viverra nisl nec rhoncus.&#160;<br/>Integer mattis, ipsum a tincidunt commodo, lacus arcu elementum elit, at mollis eros ante ac&#160;<br/>risus. In volutpat, ante at pretium ultricies, velit magna suscipit enim, aliquet blandit massa&#160;<br/>orci nec lorem. Nulla facilisi. Duis eu vehicula arcu. Nulla facilisi. Maecenas pellentesque&#160;<br/>volutpat felis, quis tristique ligula luctus vel. Sed nec mi eros. Integer augue enim, sollicitudin&#160;<br/>ullamcorper mattis eget, aliquam in est. Morbi sollicitudin libero nec augue dignissim ut&#160;<br/>consectetur dui volutpat. Nulla facilisi. Mauris egestas vestibulum neque cursus tincidunt.&#160;<br/>Donec sit amet pulvinar orci. &#160;<br/>Quisque volutpat pharetra tincidunt. Fusce sapien arcu, molestie eget varius egestas,&#160;<br/>faucibus ac urna. Sed at nisi in velit egestas aliquam ut a felis. Aenean malesuada iaculis nisl,&#160;<br/>ut tempor lacus egestas consequat. Nam nibh lectus, gravida sed egestas ut, feugiat quis&#160;<br/>dolor. Donec eu leo enim, non laoreet ante. Morbi dictum tempor vulputate. Phasellus&#160;<br/>ultricies risus vel augue sagittis euismod. Vivamus tincidunt placerat nisi in aliquam. Cras&#160;<br/>quis mi ac nunc pretium aliquam. Aenean elementum erat ac metus commodo rhoncus.&#160;<br/>
<hr/>
</body>
</html>

Now see

<img src="/var/www/html/pdf-sign/public/converted_path/create-1_1.png"/>

is just after at BODY tag. That means that image is gone to the top in replace of the third line.


Solution

  • I also faced this kind of problem. I got a solution. At first you need to convert pdf document to XML

    $output = shell_exec('pdftohtml -xml create.pdf create.xml');
    

    The XML output is like below

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
    <pdf2xml producer="poppler" version="0.33.0">
    <page number="1" position="absolute" top="0" left="0" height="1262" width="892">
    <fontspec id="0" size="16" family="Times" color="#000000"/>
    <image top="117" left="51" width="424" height="96" src="converted_path/create1.jpg"/>
    <text top="57" left="99" width="144" height="16" font="0">Test document PDF</text>
    </page>
    </pdf2xml>
    

    Then you conver this string of XML into an object

    $xml = simplexml_load_string($xmlContent);
    

    After that you need to measure the exact image place with the xml attributes top value like below

    $pg = 0;
    foreach($xml->page as $page) {
            foreach ($page as $e) {
                $all_attribute[$pg][(int)$e['top']] = $e;
            }
            $pg++;
        }
    

    After finding out all attributes top value sort the values based on array[key]

    foreach($all_attribute as $page) {
            ksort($page);
    }
    

    When all the attributes are sorted based on xml top value, simply process the html like below

    foreach($xml->page as $page) {
      foreach($page as $p){
          if($p->getName() == 'image'){
             <img width="'.$p['width'].'" height="'.$p['height'].'" src="'.$p['src'].'" >
          }
      }
    }
    

    I think it help you
    You can Also manage your text font
    xml stored all font in fontspec attribute and give an id

    <fontspec id="0" size="16" family="Times" color="#000000"/>
    

    and this id is call in text attribute font value

    <text top="57" left="99" width="144" height="16" font="0">
    

    now with the help of those values you need to process the font like below

    $font = [];
        foreach($xml->page as $page) {
            foreach ($page as $e) {
                if($e->getName() == 'fontspec'){
                    $font[(int)$e['id']]['family'] = (string)$e['family'];
                    $font[(int)$e['id']]['size'] = (string)$e['size'];
                    $font[(int)$e['id']]['color'] = (string)$e['color'];
                }
            }
        }
    

    After that you need to process this font into html

    foreach($page as $p){
                if($p->getName() == 'text'){
                    $ind = (int)$p['font'];
                    $font_size = $font[$ind]['size'];
                    $font_color = $font[$ind]['color'];
                    $font_family = $font[$ind]['family'];
                    '<span style="font-size:'.$font_size.'px;color:'.$font_color.';font-family:'.$font_family.'; font-weight: 900;">'.(string)$p.'</span>=';
              }
    }