jsoncsvjqqpdf

JQ: How to convert a very nested JSON file (tree like), in CSV with same key at different levels?


I have a PDF bookmark structure (tree like), which looks like this:

Content
  Author 1
    Piece 1
    Piece 2
  Author 2
    Piece 3
    Piece 4
      Movement 1
      Movement 2
      Movement 3
    Piece 5

Using qpdf, it could be formatted in JSON like this:

{
  "version": 2,
  "parameters": {
    "decodelevel": "generalized"
  },
  "outlines": [
    {
      "dest": [
        "4 0 R",
        "/Fit"
      ],
      "destpageposfrom1": 1,
      "kids": [
        {
          "dest": [
            "10 0 R",
            "/Fit"
          ],
          "destpageposfrom1": 2,
          "kids": [
            {
              "dest": [
                "10 0 R",
                "/Fit"
              ],
              "destpageposfrom1": 2,
              "kids": [],
              "object": "177 0 R",
              "open": true,
              "title": "Piece 1"
            },
            {
              "dest": [
                "15 0 R",
                "/Fit"
              ],
              "destpageposfrom1": 3,
              "kids": [],
              "object": "178 0 R",
              "open": true,
              "title": "Piece 2"
            }
          ],
          "object": "176 0 R",
          "open": true,
          "title": "Author 1"
        },
        {
          "dest": [
            "38 0 R",
            "/Fit"
          ],
          "destpageposfrom1": 4,
          "kids": [
            {
              "dest": [
                "38 0 R",
                "/Fit"
              ],
              "destpageposfrom1": 4,
              "kids": [],
              "object": "180 0 R",
              "open": true,
              "title": "Piece 3"
            },
            {
              "dest": [
                "45 0 R",
                "/Fit"
              ],
              "destpageposfrom1": 5,
              "kids": [
                {
                  "dest": [
                    "45 0 R",
                    "/Fit"
                  ],
                  "destpageposfrom1": 5,
                  "kids": [],
                  "object": "182 0 R",
                  "open": true,
                  "title": "Movement 1"
                },
                {
                  "dest": [
                    "53 0 R",
                    "/Fit"
                  ],
                  "destpageposfrom1": 6,
                  "kids": [],
                  "object": "183 0 R",
                  "open": true,
                  "title": "Movement 2"
                },
                {
                  "dest": [
                    "58 0 R",
                    "/Fit"
                  ],
                  "destpageposfrom1": 7,
                  "kids": [],
                  "object": "184 0 R",
                  "open": true,
                  "title": "Movement 3"
                }
              ],
              "object": "181 0 R",
              "open": true,
              "title": "Piece 4"
            },
            {
              "dest": [
                "79 0 R",
                "/Fit"
              ],
              "destpageposfrom1": 8,
              "kids": [],
              "object": "185 0 R",
              "open": true,
              "title": "Piece 5"
            }
          ],
          "object": "179 0 R",
          "open": true,
          "title": "Author 2"
        }
      ],
      "object": "175 0 R",
      "open": true,
      "title": "Contents"
    }
  ]
}

As you see, the key 'title' is used at different levels of the tree, and each node could have a 'kids' array, and so on... I cannot guess how deeply the JSON file will be nested.

I would like to create a csv file which would retain only the last "title" occurrence in each branch, but also memorize all the path to get to it and its destination page. To be more specific, I would like to create something like this:

Piece 1;Author 1;Content;2
Piece 2;Author 1;Content;3
Piece 3;Author 2;Content;4
Movement 1;Piece 4;Author 2;Content;5
Movement 2;Piece 4;Author 2;Content;6
Movement 3;Piece 4;Author 2;Content;7
Piece 5;Author 2;Content;8

I am struggling to achieve that for a few days now, reading what I could find on the internet about JQ, but I did not succeed to find a similar problem with a proper solution. Any help at this point would be deeply appreciated!

Thanks!


Solution

  • Given your sample JSON input, you can use jq to

    path(.. | objects | select(.kids == [])) as $p
    | [getpath($p | .[:.[["kids"]][], length])]
    | [reverse[].title, last.destpageposfrom1]
    | @csv
    
    "Piece 1","Author 1","Contents",2
    "Piece 2","Author 1","Contents",3
    "Piece 3","Author 2","Contents",4
    "Movement 1","Piece 4","Author 2","Contents",5
    "Movement 2","Piece 4","Author 2","Contents",6
    "Movement 3","Piece 4","Author 2","Contents",7
    "Piece 5","Author 2","Contents",8
    

    Demo

    Note that @csv uses , as item delimiter, and escapes strings using double quotes ". Use e.g. map and join to create another formatting.