I want to build a simple tool that will allow a user to take a photo of a document and extract information such as date/time and some other information
It is simple to do this through Chat GPT's UI, upload an image, ask it for some information from the document. But is just calling Chat GPT's API from my code really a viable solution?
Note: I have no real AI experience or knowledge, but I do have lots of programming experience (I work full time as a developer).
There are several ways to go about it but you can start of with Ollama
, you need to download a model
such as llama3
, if you have experience with python you are in luck.
You don't need to train a model, there are different models out there that does what you need or solves this kind of problem, all you need to do is provided the llm
your documents(images, text, pdfs etc) and ask questions on them.
However in some cases, if your pdf contains financial information such as annuity and the likes
, You might need to train it for it to understand how to do those kind of calculations or better still write a function which inherits from Langchain_tool
to instruct the llm
on how to use it for those specific cases.
It just feels like a black box that is liable to change without notice, i.e. fragile. I assumed I'd need to train and deploy my own model, is using chat gpt expensive overkill for what I want to do? If I somehow ended up with a lot of users I'm thinking this would become a problem.
Here are the general steps on how to go go about it:
Step 1:
First download Ollama
then you can pull the llama
image which would serve as your llm
, do a docker pull
of llama, preferably llama3
.
Step 2: Find a library that converts images to text or pdf such as
Optical character recognition Library (OCR)
Step 3: Find a vector_db(FAISS, chromadb and the likes) which converts text to vectors; this makes information extraction easy.
Step 4: Feed your documents to the vector_db so it can convert it to vectors because numbers are good...