langchainlangchain-js

Should JSON doc be split?


I'm working with movies_data.json that has documents like this:

[
  {
    "metadata": {
      "id": 580489,
      "original_title": "venom: let there be carnage",
      "popularity": 5401.308,
      "release_date": "2021-09-30",
      "vote_average": 6.8,
      "vote_count": 1736,
      "revenue": 424000000,
      "tagline": "",
      "poster_url": "https://image.tmdb.org/t/p/original/rjkmN1dniUHVYAtwuV3Tji7FsDO.jpg",
      "adult": 0
    },
    "embedded_data": {
      "overview": "After finding a host body in investigative reporter Eddie Brock, the alien symbiote must face a new enemy, Carnage, the alter ego of serial killer Cletus Kasady.",
      "genre": "['Science Fiction', 'Action', 'Adventure']"
    }
  },
....
]

Is it fine to split .json documents? Like in my case I have meta data and embedded data fields. Now if I split it then one's meta data might get wrongly associated with other's embedded data.

I've parsed JSON into string and on splitting, but my data gets dispersed e.g. one's meta data is getting associated with other's embedded data.

It made me questioning how should I split my data in such cases?

So far, my code looks like this:

const loader = new JSONLoader(
    "/input.json"
);

let docs = await loader.load();
// console.log(docs);
docs = JSON.stringify(docs)

const splitter = new RecursiveCharacterTextSplitter({
    chunkSize: 400,
    chunkOverlap: 1
});

const docOutput = await splitter.createDocuments([docs]);

console.log(docOutput);

Solution

  • You most likely do not want to split the metadata and embedded data of a single movie object. Unfortunately, keeping the data together in a single Document is not possible to achieve with JSONLoader and the format of your JSON file. The loader will load all strings it finds in the file into a separate Document.

    Here's an approach that will probably achieve what you want:

    1. Load the JSON file into memory and return an array of objects.
    2. Iterate through the array and create a Document for each object.
    3. At this point, you have an array of Documents. You can do whatever you need with them. For example, pass them into a vectorstore for retrieval later.

    Example:

    import { readFileSync } from "fs";
    import { Document } from "langchain/document";
    import { MemoryVectorStore } from "langchain/vectorstores/memory";
    import { OpenAIEmbeddings } from "@langchain/openai";
    
    const filename = "movies_data.json"
    const jsonData = readFileSync(filename).toString();
    const movies = JSON.parse(jsonData);
    const docs = [];
    
    for (const movie of movies) {
        const doc = new Document({
            pageContent: movie.embedded_data.overview,
            metadata: {
                id: movie.metadata.id,
                original_title: movie.metadata.original_title,
                popularity: movie.metadata.popularity,
                release_date: movie.metadata.release_date,
                vote_average: movie.metadata.vote_average,
                vote_count: movie.metadata.vote_count,
                revenue: movie.metadata.revenue,
                tagline: movie.metadata.tagline,
                poster_url: movie.metadata.poster_url,
                adult: movie.metadata.adult,
                genre: movie.embedded_data.genre,  // is this metadata?
                source: filename,
            }
        });
        docs.push(doc);
    }
    
    // load docs into vectorstore
    const vectorStore = await MemoryVectorStore.fromDocuments(
      docs,
      new OpenAIEmbeddings()
    );
    

    References:

    1. Document Loader > JSON (LangChain)