langchainpython-embeddingchromadbollama

Why is embedding CSV file taking much longer than pdf embedding in LangChain?


I successfully embedded a 400-page PDF document within 1-2 hours. However, when I tried to embed a CSV file with about 40k rows and only one column, the estimated embedding time is approximately 24 hours.

Here is the code I used:


embedder = OllamaEmbeddings(model="nomic-embed-text", show_progress=True)

file_path = 'filtered_combined_info.csv'

loader = CSVLoader(
    file_path=file_path,
    encoding='utf-8',  # or 'ISO-8859-1' if utf-8 doesn't work
    autodetect_encoding=False  # Set to True if you want to attempt autodetection
)
data = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
docs = text_splitter.split_documents(data)

persist_directory = 'db'

vectordb = Chroma.from_documents(documents=docs, 
                                 embedding=embedder,
                                 persist_directory=persist_directory)

Why is the embedding process for the CSV file taking significantly longer than for the PDF file? Are there any optimizations or changes I can make to reduce the embedding time for the CSV file?

Additionally, is there anything I am doing wrong that might be causing it to take so much time?

enter image description here


Solution

  • I removed everything of Ollama that i installed in my local machine, and moved the installation to the docker.

    First start the docker and run the following:

    docker run -d --rm -v ./ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
    

    and then install

    docker exec -it ollama ollama run nomic-embed-text
    

    Now use same like this before:

    embedder = OllamaEmbeddings(model="nomic-embed-text",
                                show_progress=True)
    

    Check the difference:

    enter image description here

    I don't know how installing on the docker seems increasing the speed but my guess is changing from windows (my machine) to docker Linux worked?