openai-apichatgpt-apigpt-3openaiembeddings

How to create embedding vectors on-premise for OpenAI models?


I want to create a chatbot for confidential company documents. Due to security concerns, I want to make the embeddings locally and store the vectors locally. I’ll use OpenAI’s API to communicate with the LLM. How do I make the embeddings locally without using OpenAI's Embeddings web API?


Solution

  • You can use spaCy to create embedding vectors. As stated in the official spaCy documentation:

    spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

    If you run get_embedding.py, you'll get the following response (i.e., a 128-dimensional embedding vector):

    [-0.4769258499145508, 0.3881034553050995, 2.819754123687744, 0.1371886432170868, 7.742671966552734, -1.2794740200042725, 3.1346824169158936, 1.4228293895721436, 0.27339574694633484, -0.4797571003437042, 6.9937872886657715, 1.4315242767333984, -3.4885427951812744, 0.8452857136726379, 0.8568257689476013, 2.3755757808685303, 5.252387523651123, 1.0385771989822388, -2.077688217163086, 0.7908112406730652, -2.546300172805786, 0.7618517279624939, -0.42370277643203735, 3.1288869380950928, 0.6323999762535095, -3.8961079120635986, -0.4977128207683563, -1.5293715000152588, -1.5151740312576294, 1.6068800687789917, -2.0303213596343994, -2.4576945304870605, 1.4395530223846436, 0.7422757148742676, -2.270634174346924, -0.15845993161201477, 0.07029717415571213, 0.672839343547821, 5.159962177276611, -0.06168988719582558, -3.129868745803833, -1.227286696434021, -2.006021499633789, 0.4333727955818176, -1.2434427738189697, 2.46277117729187, 3.537201404571533, 0.2767142653465271, -0.7451871633529663, -2.5755043029785156, -1.1589397192001343, 1.5788328647613525, -0.508418619632721, -2.740482807159424, -2.119898557662964, 0.5995656847953796, 1.1638715267181396, 4.050228595733643, -1.1868728399276733, -4.347542762756348, 4.015085697174072, -0.23206289112567902, 0.10843563079833984, -0.5687889456748962, -0.5912571549415588, 6.662228584289551, -2.3623156547546387, -4.9967570304870605, 3.283771514892578, 3.147571563720703, -2.7288429737091064, 2.373138666152954, -3.020965576171875, -0.8559457063674927, -0.9629656672477722, 1.2457185983657837, -3.6433043479919434, 3.081699848175049, -5.936647891998291, 3.153151273727417, -4.296724319458008, 0.23952282965183258, -0.616602897644043, 1.9953927993774414, 2.9439570903778076, 1.9284285306930542, 0.47489139437675476, -1.0409842729568481, 2.129765748977661, 1.1889699697494507, -2.2774386405944824, 0.35642704367637634, 3.420785903930664, -3.1786859035491943, -2.098905563354492, -2.9918370246887207, 2.8626744747161865, -1.6979585886001587, -1.1304199695587158, 0.6608514785766602, 3.7660014629364014, 2.038205623626709, 3.123993158340454, 3.6879427433013916, -2.20119309425354, 4.754899501800537, 4.687614440917969, -2.214437246322632, 0.32483574748039246, 0.5160357356071472, 4.57424259185791, 1.8791313171386719, -3.228891372680664, 2.3561816215515137, 1.2214956283569336, -1.1263086795806885, 0.9208157658576965, 0.022158537060022354, -2.070528507232666, 0.5088043808937073, -0.2275187224149704, -1.7481428384780884, 2.601383686065674, 2.5015127658843994, -0.7987513542175293, -3.5253543853759766, 0.6400442123413086, -1.7350285053253174]

    get_embedding.py

    import spacy
    
    spacy_model = spacy.load('en_core_web_lg')
    
    user_input = 'Make an embedding vector of this text'
    
    vector = spacy_model(user_input).vector[:128].tolist()
    
    print(vector)