I used PHP's pdftotext to create a lot of .txt files from pdf's.
Used it like this, which works perfectly for all the text parts in all the files:
system("pdftotext -raw dir/$pdf_file 2>&1");
THE PROBLEM
However, in the new .txt file all the images from the pdf's appear as:
So, in all those views, I get different ways to work with this weird char.
THE QUESTION
After trying so many code for a week, I am still looking for a way to find and delete this weird image char from all the .txt files.
Is there a solution for this?
Or, what is the smart thing to do here? Working with a php file with code, or on the command line? I am kind of lost on this one now.
The code convention whilst printing Plain Text is that FF usually means FormFeed it is a Control Code to the printer
ā 12 00/12 14 %0C FF (CtrL=^L) FORM FEED
(Page Break)
This is a way to indicate / eject an End Of Page, so you should see one at the division between pages.
There is a switch to remove/exclude them so try ,
system("pdftotext -raw -nopgbrk dir/$pdf_file 2>&1");