pythonopenai-apiazure-openaiopenai-assistants-api

AzureOpenAI upload a file from memory


I am building an assistant and I would like to give it a dataset to analyze. I understand that I can upload a file that an assistant can use with the following code:

from openai import AzureOpenAI
import pandas as pd

client = AzureOpenAI(**credentials_here)

pd.DataFrame({
    "A": [1, 2, 3, 4, 5],
    "B": [6, 7, 8, 9, 10],
    "C": [11, 12, 13, 14, 15],
}).to_csv('data.csv', index=False)

file = client.files.create(
    file=open(
        "data.csv",
        "rb",
    ),
    purpose="assistants",
)

I would prefer to upload the file from a data structure in memory. How can I upload a data from memory using the AzureOpenAI client?

I read that OpenAI allows users to provide bytes-like objects so I hoped I could do this with pickle.dumps

import pickle
df = pd.DataFrame({
    "A": [1, 2, 3, 4, 5],
    "B": [6, 7, 8, 9, 10],
    "C": [11, 12, 13, 14, 15],
})

file = client.files.create(
    file=pickle.dumps(df),
    purpose="assistants"
)

This snippet does not throw an error using the OpenAI client. I get the below through the AzureOpenAI client.

openai.BadRequestError: Error code: 400 - {'error': {'message': "Invalid file format. Supported formats: ['c', 'cpp', 'csv', 'docx', 'html', 'java', 'json', 'md', 'pdf', 'php', 'pptx', 'py', 'rb', 'tex', 'txt', 'css', 'jpeg', 'jpg', 'js', 'gif', 'png', 'tar', 'ts', 'xlsx', 'xml', 'zip']", 'type': 'invalid_request_error', 'param': None, 'code': None}}

Solution

  • It looks like AzureOpenAI does accept bytes encoded objects from io.BytesIO. So one easy way to do this for a dataframe is to use io.BytesIO on the string representation of a dataframe.

    import io
    df = pd.DataFrame({
        "A": [1, 2, 3, 4, 5],
        "B": [6, 7, 8, 9, 10],
        "C": [11, 12, 13, 14, 15],
    })
    
    in_memory_df = io.BytesIO(df.to_csv().encode())
    
    file = client.files.create(
        file=in_memory_df,
        purpose="assistants"
    )
    

    Tuples of (file_name, bytes_contents, file_type) are also accepted so this code snippet is also valid and more explicit.

    file = client.files.create(
        file=('name_dataset_here.csv', in_memory_df, 'text/csv'),
        purpose="assistants"
    )