Well basically I have the same problems as discussed here: http://blog.joshsoftware.com/2014/08/13/pdf-to-plain-text-processing-using-docsplit/ But the solution that they propose in docsplit doesn't work.
Docsplit.extract_text(filepath, {:pdf_opts => ‘-layout’, output: ‘tmp_text_file’})
the :pdf_opts => ‘-layout’ option doesn't do anything and I can't find any documentation about options like that, thus I get a single word per line in the output text file.
Does anyone know how to get an accurate text file ?
Thank you
If you read blog post carefully internally processing
:pdf_opts => ‘-layout’
is not supported yet by master branch of docsplit gem. For this you need to use https://github.com/documentcloud/docsplit/pull/114. So use
gem 'docsplit', git: 'git://github.com/narutosanjiv/docsplit.git'
Hope this helps. Let me know if you still face any issues.