pythondatabasenlpdatasetword-embedding

How to store Bag of Words or Embeddings in a Database


I would like to store vector features, like Bag-of-Words or Word-Embedding vectors of a large number of texts, in a dataset, stored in a SQL Database. What're the data structures and the best practices to save and retrieve these features?


Solution

  • Word vectors should generally be stored as BLOBs if possible. If not they can be stored as json arrays. Since the only reasonable operation for word vectors is to look them up by the word key the other details don't particularly matter.

    For bag of words you would typically need three columns, this is what it would look like in sqlite.

    create table bow (
      doc_id int,
      word text,
      count int)
    

    Where your document IDs come from somewhere else. If you need to you can make (doc_id, word) the key.

    However, storing features like this in a SQL DB is generally not helpful. When you access word counts or word vectors you typically don't need a subset of them, you need them all at once, so the relational features of SQL aren't helpful.