I'm working on a TypeScript project and need to save a large list of records in Parquet format using SNAPPY compression with an explicit schema.
I've successfully written data using both parquetjs and duckdb, but I’ve noticed that parquetjs is significantly slower — nearly 7× slower in the test below. The issue with DuckDb is that it infers type from provided data (which is not something I want).
Is there a way to quickly save a large list of records in Parquet format using SNAPPY compression with an explicit schema ?
import pkg from "duckdb";
const { Database } = pkg;
import parquet from "parquetjs";
// Generate fake records
const N_RECORDS = 2 * 10e5;
const records = Array.from({ length: N_RECORDS }, (_, index) => ({
name: `fake_name_${index}`,
value: index,
}));
console.log("Records generated");
// Save records using duckdb
const saveRecordsWithDuckDb = async () => {
const path = "./duckdb.parquet";
const db = new Database(":memory:");
const fs = await import("fs/promises");
await fs.writeFile(path, JSON.stringify(records, null, 2));
const copySQL = `
CREATE OR REPLACE TABLE temp_data AS
SELECT * FROM read_json_auto('${path}');
COPY temp_data TO '${path}' (FORMAT PARQUET, COMPRESSION 'SNAPPY');
`;
await new Promise<void>((resolve, reject) => {
db.run(copySQL, (err) => (err ? reject(err) : resolve()));
});
};
// Save records using parquetjs
const saveRecordsWithParquetJs = async () => {
const path = "./parquetjs.parquet";
const schema = new parquet.ParquetSchema({
name: { type: "UTF8", compression: "SNAPPY" },
value: { type: "INT32", compression: "SNAPPY" },
} as any);
const writer = await parquet.ParquetWriter.openFile(schema, path);
for (const record of records) {
await writer.appendRow(record);
}
await writer.close();
};
console.time("saveRecordsWithDuckDb");
await saveRecordsWithDuckDb();
console.timeEnd("saveRecordsWithDuckDb");
console.time("saveRecordsWithParquetJs");
await saveRecordsWithParquetJs();
console.timeEnd("saveRecordsWithParquetJs");
saveRecordsWithDuckDb: 831.669ms
saveRecordsWithParquetJs: 5.799s
Note: With DuckDB, I’m not specifying the schema, so it infers column types from the data. This becomes problematic in production: if a nullable column has only null values during one save, DuckDB can't infer the correct type and defaults to STRING. As a result, some Parquet files have inconsistent schemas (wrong column types or missing fields).
If you already know the schema, you could create the duckdb table with an explicit schema rather than having it inferred, or cast the columns on the select?
CREATE TABLE temp_data ();
INSERT INTO temp_data SELECT * FROM read_json_auto('${path}');
CREATE OR REPLACE TABLE temp_data AS
SELECT cast(x as a), cast(y as b) FROM read_json_auto('${path}');