javascriptfirebasegoogle-cloud-vertex-aigoogle-gemini

How do you add non-text parts to a multiturn chat with Gemini / Vertex AI?


Overview

After reviewing the documentation, it looks like multi-turn chat is only optimized for receiving a text prompt, but clearly this is something that is implemented across many AI products--Gemini, Claude, ChatGPT, and more.

So, what am I missing here? What's the right approach to managing / supporting a multi-turn, multi-modal conversation with Vertex / Gemini 1.5?

Desired Behavior

 // To generate text output, call generateContent with the text and image
  const result = await model.generateContent([prompt, imagePart]);

Stack


Solution

  • When starting new chat session, you submit history with defined parts for user/model, those parts can be text/image/video etc.

    If we modify the sample from the docs link you sent:

    const chat = model.startChat({
      history: [
        {
          role: "user",
          parts: [
            { text: "Hello, I have 2 dogs in my house." },
            {
              inlineData: {
                mimeType: "image/jpeg",
                data: `base64-encoded-image-data`,
              },
            },
          ],
        },
        {
          role: "model",
          parts: [{ text: "Great to meet you. What would you like to know?" }],
        },
      ],
      generationConfig: {
        maxOutputTokens: 100,
      },
    });
    

    Here you can see how you can include 2 parts into single response.

    Exact part definition - https://firebase.google.com/docs/reference/js/vertexai-preview.content.md#content_interface

    Same can be done when submitting a reply to already started chat conversion - https://firebase.google.com/docs/reference/js/vertexai-preview.chatsession#chatsessionsendmessage