preprocessorvector-databasevector-searchmarqo

Resources on preprocessing pdf files + other data before running with Marqo


I was wondering if anyone had any good resources on preprocessing pdf files and other data before running it through Marqo, the vector search engine?

I'm just looking for best practices on data formatting before passing data to Marqo.


Solution

  • For PDFs I would focus on cleaning up redundant whitespace and strange characters/formatting that might occur around tables and similar things when doing the extraction. Often around things like titles and section breaks there will be redundant new line. Depending on your use case and the size of the PDFs you might want to index logically separable things like sections into their own documents but that depends on if you want multiple matches per PDF or not. Hope that helps!