phppdfcoordinatespdftotextpdfparser

How to get the specific coordinates of each contents in PDF file?


I use Smalot\PdfParser for extract contents from PDF. As a beginner, I try to mess around with basic functions like getText(), getDetails(), getPages() .etc then I notice this return from $data = dd($page->getDataTM);:

0 => array:4 [▼
    0 => array:6 [▼
      0 => "1.00055"
      1 => "0"
      2 => "0"
      3 => "1"
      4 => "70.8"
      5 => "760.24"
    ]
    1 => " "
    2 => "R8"
    3 => "12"
  ]
  1 => array:4 [▼
    0 => array:6 [▼
      0 => "1.00055"
      1 => "0"
      2 => "0"
      3 => "1"
      4 => "70.8"
      5 => "745.72"
    ]
    1 => "Column1  Column2  Column3 "
    2 => "R10"
    3 => "12"
  ]

So in $data[i][0] got to be the coordinates I need but I dont know which is X, Y or how to specifically extract it.

Here is the code:

use Smalot\PdfParser\Parser;
use Smalot\PdfParser\Config;

/* ... */

    protected function getCoordinates($pdfPath)
    {
        // get font details by config
        $config = new Config();
        $config->setDataTmFontInfoHasToBeIncluded(true);
        // get PDF structure
        $parser = new Parser([], $config);
        $pdf = $parser->parseFile($pdfPath);
        $coordinates = [];
        //dd($pdf->getPages()[1]->getDataTm());

        foreach ($pdf->getPages() as $page)
        {
            $page->getDataTm();
            $text = $page->getText();
            //$coordinates = ; // This is where I want to extract it
        }
        return $coordinates;
    }

Here is the sample content in PDF I can copy:

Column1 Column2 Column3 L1C1 L1C2 L1C3 L2C1 L2C2 L2C3 L3C1 L3C2 L3C3 L4C1 L4C2 L4C3

It has table and border. What output I expect after extract it to .txt:

[Page : 1, width = 1, height = 2]

[x:0, y:3, w: 4, h:5]Column1 Column2 Column3

[x:6, y:7, w: 8, h:5]L1C1 L1C2 L1C3

[x:6, y:7, w: 8, h:5]L2C1 L2C2 L2C3

[x:6, y:7, w: 8, h:5]L3C1 L3C2 L3C3

[x:6, y:7, w: 8, h:5]L4C1 L4C2 L4C3

Those numbers from to 1 to 8 is I made up from dd($pdf->getPages()[1]->getDataTm());, I see some numbers are the same so those that's why made up numbers aren't many. Also the PDF have more than 1 page too.


Solution

  • To get page width and height I use $page->getDetails(); because $page->getDataTm() doesn't have those elements. Here is the code:

    protected function getCoordinates($pdfPath)
    {
        $config = new Config();
        // add configs stuff
        $parser = new Parser([], $config);
        $pdf = $parser->parseFile($pdfPath);
        $coordinates = [];
        $currentPage = 1;
    
        foreach ($pdf->getPages() as $page)
        {
            $details = $page->getDetails();
            $coordinates[] = "\n[Page : $currentPage, width = {$details['MediaBox'][2]}, height = {$details['MediaBox'][3]}]";
            foreach($page->getDataTm() as $data)
            {
                $x = $data[0][4];
                // Calculate y from the bottom
                $y = $details['MediaBox'][3] - $data[0][5];
                $w = $data[0][0];
                $h = $data[0][3];
                // Parser add \\r when a line on 1 row is too long so discard it
                $s = mb_convert_encoding(str_replace("\\\r", '', $data[1]), 'UTF-8');
                $coordinates[] = "[x:$x, y:$y, w: $w, h:$h]{$s}";
            }
            $currentPage++;
        }
        if ($coordinates === [])
            return back()->with('error', 'Coordinates not .');
        return $coordinates;
    }
    

    Also I don't know why Parser automatically adds \r when a line is too long. When I extract it to a .txt file, it shows up randomly like Column1 Col\rumn2. I installed spatie/pdf-to-text and it doesn't break when a line is too long. But spatie doesn't extract PDF header and coordinates so I have to stick with pdfParser.