I'm working with movies_data.json
that has documents like this:
[
{
"metadata": {
"id": 580489,
"original_title": "venom: let there be carnage",
"popularity": 5401.308,
"release_date": "2021-09-30",
"vote_average": 6.8,
"vote_count": 1736,
"revenue": 424000000,
"tagline": "",
"poster_url": "https://image.tmdb.org/t/p/original/rjkmN1dniUHVYAtwuV3Tji7FsDO.jpg",
"adult": 0
},
"embedded_data": {
"overview": "After finding a host body in investigative reporter Eddie Brock, the alien symbiote must face a new enemy, Carnage, the alter ego of serial killer Cletus Kasady.",
"genre": "['Science Fiction', 'Action', 'Adventure']"
}
},
....
]
Is it fine to split .json
documents? Like in my case I have meta data and embedded data fields. Now if I split it then one's meta data might get wrongly associated with other's embedded data.
I've parsed JSON into string and on splitting, but my data gets dispersed e.g. one's meta data is getting associated with other's embedded data.
It made me questioning how should I split my data in such cases?
So far, my code looks like this:
const loader = new JSONLoader(
"/input.json"
);
let docs = await loader.load();
// console.log(docs);
docs = JSON.stringify(docs)
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 400,
chunkOverlap: 1
});
const docOutput = await splitter.createDocuments([docs]);
console.log(docOutput);
You most likely do not want to split the metadata and embedded data of a single movie object. Unfortunately, keeping the data together in a single Document
is not possible to achieve with JSONLoader
and the format of your JSON file. The loader will load all strings it finds in the file into a separate Document
.
Here's an approach that will probably achieve what you want:
Document
for each object.Document
s. You can do whatever you need with them. For example, pass them into a vectorstore for retrieval later.Example:
import { readFileSync } from "fs";
import { Document } from "langchain/document";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OpenAIEmbeddings } from "@langchain/openai";
const filename = "movies_data.json"
const jsonData = readFileSync(filename).toString();
const movies = JSON.parse(jsonData);
const docs = [];
for (const movie of movies) {
const doc = new Document({
pageContent: movie.embedded_data.overview,
metadata: {
id: movie.metadata.id,
original_title: movie.metadata.original_title,
popularity: movie.metadata.popularity,
release_date: movie.metadata.release_date,
vote_average: movie.metadata.vote_average,
vote_count: movie.metadata.vote_count,
revenue: movie.metadata.revenue,
tagline: movie.metadata.tagline,
poster_url: movie.metadata.poster_url,
adult: movie.metadata.adult,
genre: movie.embedded_data.genre, // is this metadata?
source: filename,
}
});
docs.push(doc);
}
// load docs into vectorstore
const vectorStore = await MemoryVectorStore.fromDocuments(
docs,
new OpenAIEmbeddings()
);
References: