How do you add non-text parts to a multiturn chat with Gemini / Vertex AI?

Overview

After reviewing the documentation, it looks like multi-turn chat is only optimized for receiving a text prompt, but clearly this is something that is implemented across many AI products--Gemini, Claude, ChatGPT, and more.

So, what am I missing here? What's the right approach to managing / supporting a multi-turn, multi-modal conversation with Vertex / Gemini 1.5?

Desired Behavior

Use the startChat method to start a multiturn chat.
Generate text (non-streaming) with multimodal input as seen in the code block below

 // To generate text output, call generateContent with the text and image
  const result = await model.generateContent([prompt, imagePart]);

Stack

Javascript
vertexai-preview library (part of the "firebase": "^10.12.2",) package

Solution

When starting new chat session, you submit history with defined parts for user/model, those parts can be text/image/video etc.

If we modify the sample from the docs link you sent:

const chat = model.startChat({
  history: [
    {
      role: "user",
      parts: [
        { text: "Hello, I have 2 dogs in my house." },
        {
          inlineData: {
            mimeType: "image/jpeg",
            data: `base64-encoded-image-data`,
          },
        },
      ],
    },
    {
      role: "model",
      parts: [{ text: "Great to meet you. What would you like to know?" }],
    },
  ],
  generationConfig: {
    maxOutputTokens: 100,
  },
});

Here you can see how you can include 2 parts into single response.

Exact part definition - https://firebase.google.com/docs/reference/js/vertexai-preview.content.md#content_interface

Same can be done when submitting a reply to already started chat conversion - https://firebase.google.com/docs/reference/js/vertexai-preview.chatsession#chatsessionsendmessage