[SOLVED] Why is parquetjs so much slower than duckdb for writing Parquet files, and how can I speed it up?

Why is parquetjs so much slower than duckdb for writing Parquet files, and how can I speed it up?

I'm working on a TypeScript project and need to save a large list of records in Parquet format using SNAPPY compression with an explicit schema.

I've successfully written data using both parquetjs and duckdb, but I’ve noticed that parquetjs is significantly slower — nearly 7× slower in the test below. The issue with DuckDb is that it infers type from provided data (which is not something I want).

Is there a way to quickly save a large list of records in Parquet format using SNAPPY compression with an explicit schema ?

Code example

import pkg from "duckdb";
const { Database } = pkg;
import parquet from "parquetjs";

// Generate fake records
const N_RECORDS = 2 * 10e5;
const records = Array.from({ length: N_RECORDS }, (_, index) => ({
  name: `fake_name_${index}`,
  value: index,
}));
console.log("Records generated");

// Save records using duckdb
const saveRecordsWithDuckDb = async () => {
  const path = "./duckdb.parquet";
  const db = new Database(":memory:");
  const fs = await import("fs/promises");
  await fs.writeFile(path, JSON.stringify(records, null, 2));
  const copySQL = `
  CREATE OR REPLACE TABLE temp_data AS
  SELECT * FROM read_json_auto('${path}');
  COPY temp_data TO '${path}' (FORMAT PARQUET, COMPRESSION 'SNAPPY');
`;
  await new Promise<void>((resolve, reject) => {
    db.run(copySQL, (err) => (err ? reject(err) : resolve()));
  });
};

// Save records using parquetjs
const saveRecordsWithParquetJs = async () => {
  const path = "./parquetjs.parquet";
  const schema = new parquet.ParquetSchema({
    name: { type: "UTF8", compression: "SNAPPY" },
    value: { type: "INT32", compression: "SNAPPY" },
  } as any);

  const writer = await parquet.ParquetWriter.openFile(schema, path);
  for (const record of records) {
    await writer.appendRow(record);
  }
  await writer.close();
};

console.time("saveRecordsWithDuckDb");
await saveRecordsWithDuckDb();
console.timeEnd("saveRecordsWithDuckDb");

console.time("saveRecordsWithParquetJs");
await saveRecordsWithParquetJs();
console.timeEnd("saveRecordsWithParquetJs");

Result for 200k records

saveRecordsWithDuckDb: 831.669ms
saveRecordsWithParquetJs: 5.799s

Note: With DuckDB, I’m not specifying the schema, so it infers column types from the data. This becomes problematic in production: if a nullable column has only null values during one save, DuckDB can't infer the correct type and defaults to STRING. As a result, some Parquet files have inconsistent schemas (wrong column types or missing fields).

Solution

If you already know the schema, you could create the duckdb table with an explicit schema rather than having it inferred, or cast the columns on the select?

  CREATE  TABLE temp_data (); 
  INSERT INTO temp_data SELECT * FROM read_json_auto('${path}');


CREATE OR REPLACE TABLE temp_data AS
  SELECT cast(x as a), cast(y as b) FROM read_json_auto('${path}');