javadocumentlarge-language-modelfew-shot-learninglangchain4j

How to re-use embedded documents for Few-Shot LLM queries in Langchain4j?


I have an LLM Chat model with token limitation. I am trying to pass Sample User Messages and Expected AI Message Responses to the LLM to train it how to provide a response based on text extracted from a document. I am loading the document with System Loader

 Document document = loadDocument(toPath("file:///filepath\\filename.pdf"));

I am using regex splitter to help the LLM understand a pattern

   DocumentByRegexSplitter splitter=new DocumentByRegexSplitter(regex,joiner,maxCharLimit,maxOverlap,subSplitter);

After embedding the document (In-Memory embedding store and getting the relevant vectors), I join it into an information string which I can feed into a prompt template to generate a User Message

PromptTemplate promptTemplate = PromptTemplate.from(
            "Answer the following question to the best of your ability"
                    + "Question:\n"
                    + "{{question}}\n"
                    + "\n"
                    + "Base your answer on the following information:\n"
                    + "{{information}}");

String information = relevantEmbeddings.stream()
        .map(match -> match.embedded().text())
        .collect(joining("\n\n"));

Map<String, Object> variables = new HashMap<>();
variables.put("question", trainingQuestion);
variables.put("information", information);
Prompt prompt = promptTemplate.apply(variables);


List<ChatMessage> chatMessages=new ArrayList<>();
chatMessages.add(prompt .toUserMessage());
chatMessages.add(new AiMessage("Expected Response"));

    variables.put("question", actualQuestion);
    variables.put("information", information);
    prompt = promptTemplate.apply(variables);
chatMessages.add(prompt .toUserMessage());

I will add the traning messages to a List as required by the Java Langchain framework

AiMessage response=chatModel.generate(chatMessages);

To make a long story short, I am facing the token constraint because of embedding the same document information for all the Few Shot messages. Is there a way to make the LLM use the same document as a reference for the Few-Shot training and the actual query so I can avoid consuming tokens for the document multiple times?


Solution

  • I got a suggestion from a colleague to ad the document to SystemMessage so it won't have be passed multiple times for the training and actual User Messages. Will try this and update