I am making a script which is supposed to search inside all the pdf files in a directory. I have found one converted named "pdftotext" which enables me to use grep on pef files, but I am able to run it only with one file. When I want to run it over all the files present in directory then it fails. Any suggestions ?
This works:for a single file
pdftotext my_file.pdf - | grep 'hot'
This fails: for searching pdf files and converting to text and greping
SHELL PROMPT>find ~/.personal/tips -type f -iname "*" | grep -i "*.pdf" | xargs pdftotext |grep admin
pdftotext version 3.00
Copyright 1996-2004 Glyph & Cog, LLC
Usage: pdftotext [options] <PDF-file> [<text-file>]
-f <int> : first page to convert
-l <int> : last page to convert
-layout : maintain original physical layout
-raw : keep strings in content stream order
-htmlmeta : generate a simple HTML file, including the meta information
-enc <string> : output text encoding name
-eol <string> : output end-of-line convention (unix, dos, or mac)
-nopgbrk : don't insert page breaks between pages
-opw <string> : owner password (for encrypted files)
-upw <string> : user password (for encrypted files)
-q : don't print any messages or errors
-cfg <string> : configuration file to use in place of .xpdfrc
-v : print copyright and version info
-h : print usage information
-help : print usage information
--help : print usage information
-? : print usage information
SHELL PROMPT 139>
xargs
is the wrong tool for this job: find
does everything you need built-in.
find ~/.personal/tips \
-type f \
-iname "*.pdf" \
-exec pdftotext '{}' - ';' \
| grep hot
That said, if you did want to use xargs
for some reason, correct usage would look something like...
find ~/.personal/tips \
-type f \
-iname "*.pdf" \
-print0 \
| xargs -0 -J % -n 1 pdftotext % - \
| grep hot
Note that:
find
command uses -print0
to NUL-delimit its outputxargs
command uses -0
to NUL-delimit its input (which also turns off some behavior which would lead to incorrect handling of filenames with whitespace in their names, literal quote characters, etc).xargs
command uses -n 1
to call pdftotext
once per filexargs
command uses -J %
to specify a sigil for where the replacement should happen, and uses that %
in the pdftotext command line appropriately.