bigdatajqgeojson

Streaming a big geojson with jq


I have some big geojson (see California.geojson has an example https://github.com/microsoft/USBuildingFootprints?tab=readme-ov-file#download-links). That I want to convert to a .csv (faster to import in Postgres with \copy). I am passing it to STDIN.

They are "type": "Polygon" but for my needs this is mostly point data and I do not need the full geometry (and neither the attributes). jq is perfect for this task.

Sadly, some of the biggest file seems to be big to be stored in memory (the process generated a "Killed" message). I tried the --stream argument but I failed to understand it or the process seems to be very slow (more than 3 hours and still "running").

A sample can be produced thanks to (see at the bottom of this post to get a copy of it):

jq '.features = .features[:5]' data/Alabama.geojson > sample.geojson 

This is working great for "smaller" geojson (< 1.4 GB):

jq '.features | map(.geometry.coordinates) | map(.[]) | map(first) | .[] | {"long": first, "lat": last} | [.long, .lat] | @csv' small.geojson

But I got a "Killed" message (I am assuming I am running out of memory)

Then I tried the --stream, I am not sure that I understand it correctly (this post and this issue were a big help)

This is my version with --stream (lot of "hacking")

cat sample.geojson | jq --stream "fromstream(1|truncate_stream(inputs))" | jq ' map(.geometry.coordinates) | map(.[]) | map(first) | .[] | {"long": first, "lat": last} | [.long, .lat] | @csv'

It works against sample.geojson but failed on a big geojson (say 'Ohio.geojson'). Any Ideas?

I also tried to write in a file but without any more success.

geojson sample:

{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [
            [
              -84.959634,
              32.421887
            ],
            [
              -84.95982,
              32.421889
            ],
            [
              -84.959822,
              32.421797
            ],
            [
              -84.959767,
              32.421796
            ],
            [
              -84.959767,
              32.421771
            ],
            [
              -84.959636,
              32.421769
            ],
            [
              -84.959634,
              32.421887
            ]
          ]
        ]
      },
      "properties": {
        "release": 2,
        "capture_dates_range": "3/26/2020-7/22/2020"
      }
    },
    {
      "type": "Feature",
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [
            [
              -84.959636,
              32.42095
            ],
            [
              -84.959715,
              32.42095
            ],
            [
              -84.959714,
              32.420984
            ],
            [
              -84.959816,
              32.420985
            ],
            [
              -84.959818,
              32.420849
            ],
            [
              -84.959637,
              32.420848
            ],
            [
              -84.959636,
              32.42095
            ]
          ]
        ]
      },
      "properties": {
        "release": 2,
        "capture_dates_range": "3/26/2020-7/22/2020"
      }
    },
    {
      "type": "Feature",
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [
            [
              -84.959998,
              32.235231
            ],
            [
              -84.959877,
              32.235231
            ],
            [
              -84.959877,
              32.235288
            ],
            [
              -84.959998,
              32.235288
            ],
            [
              -84.959998,
              32.235231
            ]
          ]
        ]
      },
      "properties": {
        "release": 1,
        "capture_dates_range": ""
      }
    },
    {
      "type": "Feature",
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [
            [
              -84.960253,
              32.422248
            ],
            [
              -84.960069,
              32.422245
            ],
            [
              -84.960067,
              32.422321
            ],
            [
              -84.960165,
              32.422323
            ],
            [
              -84.960164,
              32.422364
            ],
            [
              -84.96025,
              32.422365
            ],
            [
              -84.960253,
              32.422248
            ]
          ]
        ]
      },
      "properties": {
        "release": 2,
        "capture_dates_range": "3/26/2020-7/22/2020"
      }
    },
    {
      "type": "Feature",
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [
            [
              -84.961602,
              32.419206
            ],
            [
              -84.961599,
              32.419354
            ],
            [
              -84.961707,
              32.419355
            ],
            [
              -84.961708,
              32.419291
            ],
            [
              -84.961794,
              32.419292
            ],
            [
              -84.961796,
              32.419208
            ],
            [
              -84.961602,
              32.419206
            ]
          ]
        ]
      },
      "properties": {
        "release": 2,
        "capture_dates_range": "3/26/2020-7/22/2020"
      }
    }
  ]
}


Solution

  • Your original filter can be simplified as follows:

    In summary, with further contractions like .[] | first to .[][0], you can achieve the same using

    .features[].geometry.coordinates[][0] | @csv
    

    This goes five levels deep, then picks the first item, and turns it into a CSV output. So, this can be translated to a --stream version as folloews:

    fromstream(5|truncate_stream(inputs))[0] | @csv
    
    -114.127454,34.265674
    -114.127694,34.260939
    -114.127988,34.264977
    -114.129007,34.260229
    -114.129611,34.261105
    -114.130311,34.263922
    -114.131834,34.284069
    -114.132183,34.28509
    -114.132634,34.281492
    -114.133764,34.282816
    :