After reviewing the documentation, it looks like multi-turn chat is only optimized for receiving a text prompt, but clearly this is something that is implemented across many AI products--Gemini, Claude, ChatGPT, and more.
So, what am I missing here? What's the right approach to managing / supporting a multi-turn, multi-modal conversation with Vertex / Gemini 1.5?
startChat
method to start a multiturn chat. // To generate text output, call generateContent with the text and image
const result = await model.generateContent([prompt, imagePart]);
vertexai-preview
library (part of the "firebase": "^10.12.2",
) packageWhen starting new chat session, you submit history with defined parts for user/model, those parts can be text/image/video etc.
If we modify the sample from the docs link you sent:
const chat = model.startChat({
history: [
{
role: "user",
parts: [
{ text: "Hello, I have 2 dogs in my house." },
{
inlineData: {
mimeType: "image/jpeg",
data: `base64-encoded-image-data`,
},
},
],
},
{
role: "model",
parts: [{ text: "Great to meet you. What would you like to know?" }],
},
],
generationConfig: {
maxOutputTokens: 100,
},
});
Here you can see how you can include 2 parts into single response.
Exact part definition - https://firebase.google.com/docs/reference/js/vertexai-preview.content.md#content_interface
Same can be done when submitting a reply to already started chat conversion - https://firebase.google.com/docs/reference/js/vertexai-preview.chatsession#chatsessionsendmessage