I use Smalot\PdfParser for extract contents from PDF. As a beginner, I try to mess around with basic functions like getText(), getDetails(), getPages() .etc then I notice this return from $data = dd($page->getDataTM);
:
0 => array:4 [▼
0 => array:6 [▼
0 => "1.00055"
1 => "0"
2 => "0"
3 => "1"
4 => "70.8"
5 => "760.24"
]
1 => " "
2 => "R8"
3 => "12"
]
1 => array:4 [▼
0 => array:6 [▼
0 => "1.00055"
1 => "0"
2 => "0"
3 => "1"
4 => "70.8"
5 => "745.72"
]
1 => "Column1 Column2 Column3 "
2 => "R10"
3 => "12"
]
So in $data[i][0]
got to be the coordinates I need but I dont know which is X, Y or how to specifically extract it.
Here is the code:
use Smalot\PdfParser\Parser;
use Smalot\PdfParser\Config;
/* ... */
protected function getCoordinates($pdfPath)
{
// get font details by config
$config = new Config();
$config->setDataTmFontInfoHasToBeIncluded(true);
// get PDF structure
$parser = new Parser([], $config);
$pdf = $parser->parseFile($pdfPath);
$coordinates = [];
//dd($pdf->getPages()[1]->getDataTm());
foreach ($pdf->getPages() as $page)
{
$page->getDataTm();
$text = $page->getText();
//$coordinates = ; // This is where I want to extract it
}
return $coordinates;
}
Here is the sample content in PDF I can copy:
Column1 Column2 Column3 L1C1 L1C2 L1C3 L2C1 L2C2 L2C3 L3C1 L3C2 L3C3 L4C1 L4C2 L4C3
It has table and border. What output I expect after extract it to .txt:
[Page : 1, width = 1, height = 2]
[x:0, y:3, w: 4, h:5]Column1 Column2 Column3
[x:6, y:7, w: 8, h:5]L1C1 L1C2 L1C3
[x:6, y:7, w: 8, h:5]L2C1 L2C2 L2C3
[x:6, y:7, w: 8, h:5]L3C1 L3C2 L3C3
[x:6, y:7, w: 8, h:5]L4C1 L4C2 L4C3
Those numbers from to 1 to 8 is I made up from dd($pdf->getPages()[1]->getDataTm());
, I see some numbers are the same so those that's why made up numbers aren't many. Also the PDF have more than 1 page too.
To get page width and height I use $page->getDetails();
because $page->getDataTm()
doesn't have those elements.
Here is the code:
protected function getCoordinates($pdfPath)
{
$config = new Config();
// add configs stuff
$parser = new Parser([], $config);
$pdf = $parser->parseFile($pdfPath);
$coordinates = [];
$currentPage = 1;
foreach ($pdf->getPages() as $page)
{
$details = $page->getDetails();
$coordinates[] = "\n[Page : $currentPage, width = {$details['MediaBox'][2]}, height = {$details['MediaBox'][3]}]";
foreach($page->getDataTm() as $data)
{
$x = $data[0][4];
// Calculate y from the bottom
$y = $details['MediaBox'][3] - $data[0][5];
$w = $data[0][0];
$h = $data[0][3];
// Parser add \\r when a line on 1 row is too long so discard it
$s = mb_convert_encoding(str_replace("\\\r", '', $data[1]), 'UTF-8');
$coordinates[] = "[x:$x, y:$y, w: $w, h:$h]{$s}";
}
$currentPage++;
}
if ($coordinates === [])
return back()->with('error', 'Coordinates not .');
return $coordinates;
}
Also I don't know why Parser automatically adds \r
when a line is too long. When I extract it to a .txt
file, it shows up randomly like Column1 Col\rumn2
. I installed spatie/pdf-to-text and it doesn't break when a line is too long. But spatie doesn't extract PDF header and coordinates so I have to stick with pdfParser.