jsonpostgresqlimport

Fastest way to load "maximally adversarial" text data into postgres


I have ~1TB of text data stored across a couple thousand compressed JSON files that I'd like to load into postgres. The problem with any of the obvious approaches like pg_bulkload or COPY is that the text data is likely to contain any given delimiter! Any ideas? Ideally some way to bulkload using file newlines as a separator


Solution

  • I've found a few solutions:

    (1) Use 2 double quotes as a separator (e.g. ""), for the data to be valid json, at least the double quote character will always be escaped in the data field, allowing you to convert to a CSV using jq (or a tool of choice) with "" as separator (enabling the use of fast ingest tools like pg_bulkload

    (2) Load the data using a tool for parquet like pyarrow, then import to the database. I'm unsure how efficient this is--but parquet avoids the separator issue/supports loading from JSON natively

    (3) Sanitize then load--you can escape commas in the text field. It'll be slow (sanitizing 3TB is expensive), but it will work

    (4) Import as jsonb into postgres, then create a new table from the jsonb one. This will require at least 2x the disk space of your raw data