I need to extract the headings of my PDF file which start with #
symbols through PHP. I don't know how to do it. Here is my PDF file link:
https://afxwebdesign.com/order.pdf
I have tried this script:
<?php
// Load the PDF file
$pdfFile = 'order.pdf';
// Use a PDF parsing library like TCPDF or FPDI to extract text
// Code snippet using TCPDF
require_once('tcpdf.php');
require_once('vendor/setasign/fpdi/src/autoload.php');
use setasign\Fpdi\Tcpdf\Fpdi;
$pdf = new Fpdi();
$pageCount = $pdf->setSourceFile($pdfFile);
for ($pageNo = 1; $pageNo <= $pageCount; $pageNo++) {
$templateId = $pdf->importPage($pageNo);
$text = $pdf->getPageContent($pageNo);
preg_match_all('/^#[^#].*$/m', $text, $headings);
foreach ($headings[0] as $heading) {
echo $heading . "\n";
}
}
$pdf->close();
?>
But it's not working - it throws this error:
Fatal error: Uncaught Error: Call to undefined method setasign\Fpdi\Tcpdf\Fpdi::getPageContent() in C:\xampp\htdocs\pdfextract\index.php:17 Stack trace: #0 {main} thrown in C:\xampp\htdocs\pdfextract\index.php on line 17
I skipped the PHP to extract the text where '#' symbol is located, but I used the pdf.js javascript library and it is working absolutely fine. here is the complete javascript code. it is working 100% fine.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>PDF Line Extractor with Screenshot</title>
<script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.10.377/pdf.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.4.1/html2canvas.min.js"></script>
</head>
<body>
<input type="file" id="file-input" />
<pre id="output"></pre>
<canvas id="pdf-canvas"></canvas>
<script>
document.getElementById('file-input').addEventListener('change', function(event) {
const file = event.target.files[0];
if (file) {
const reader = new FileReader();
reader.onload = function(e) {
const typedarray = new Uint8Array(e.target.result);
extractLinesAndScreenshotsFromPDF(typedarray).then(lines => {
document.getElementById('output').textContent = lines.join('\n');
}).catch(error => {
console.error('Error extracting lines and screenshots:', error);
});
};
reader.readAsArrayBuffer(file);
}
});
async function extractLinesAndScreenshotsFromPDF(data) {
const pdf = await pdfjsLib.getDocument({ data }).promise;
let extractedLines = [];
for (let pageNum = 1; pageNum <= pdf.numPages; pageNum++) {
const page = await pdf.getPage(pageNum);
const textContent = await page.getTextContent();
// Group text items by their y-coordinate
const groupedText = {};
textContent.items.forEach(item => {
const y = Math.floor(item.transform[5]); // Use y-coordinate for grouping
if (!groupedText[y]) {
groupedText[y] = [];
}
groupedText[y].push(item.str);
});
// Concatenate items to form complete lines
const pageTextLines = Object.values(groupedText).map(items => items.join(' '));
const filteredLines = pageTextLines.filter(line => line.includes('#'));
if (filteredLines.length > 0) {
extractedLines = extractedLines.concat(filteredLines);
// Render the page on canvas
const viewport = page.getViewport({ scale: 1.5 });
const canvas = document.getElementById('pdf-canvas');
const context = canvas.getContext('2d');
canvas.width = viewport.width;
canvas.height = viewport.height;
await page.render({ canvasContext: context, viewport }).promise;
// Take screenshot and send to server
html2canvas(canvas).then(canvas => {
const imgData = canvas.toDataURL('image/png');
const blob = dataURLToBlob(imgData);
const formData = new FormData();
const dt= "<?= date('MdyHis') ?>";
const randd = Math.floor(Math.random() * 9999990);
formData.append('screenshot', blob, `screenshot-${pageNum+dt+randd}.png`);
fetch('save_screenshot.php', {
method: 'POST',
body: formData
}).then(response => {
if (!response.ok) {
throw new Error('Network response was not ok');
}
return response.text();
}).then(data => {
console.log('Screenshot saved:', data);
}).catch(error => {
console.error('Error saving screenshot:', error);
});
});
}
}
return extractedLines;
}
function dataURLToBlob(dataURL) {
const byteString = atob(dataURL.split(',')[1]);
const mimeString = dataURL.split(',')[0].split(':')[1].split(';')[0];
const ab = new ArrayBuffer(byteString.length);
const ia = new Uint8Array(ab);
for (let i = 0; i < byteString.length; i++) {
ia[i] = byteString.charCodeAt(i);
}
return new Blob([ab], { type: mimeString });
}
</script>
</body>
</html>