I have a PDF bookmark structure (tree like), which looks like this:
Content
Author 1
Piece 1
Piece 2
Author 2
Piece 3
Piece 4
Movement 1
Movement 2
Movement 3
Piece 5
Using qpdf, it could be formatted in JSON like this:
{
"version": 2,
"parameters": {
"decodelevel": "generalized"
},
"outlines": [
{
"dest": [
"4 0 R",
"/Fit"
],
"destpageposfrom1": 1,
"kids": [
{
"dest": [
"10 0 R",
"/Fit"
],
"destpageposfrom1": 2,
"kids": [
{
"dest": [
"10 0 R",
"/Fit"
],
"destpageposfrom1": 2,
"kids": [],
"object": "177 0 R",
"open": true,
"title": "Piece 1"
},
{
"dest": [
"15 0 R",
"/Fit"
],
"destpageposfrom1": 3,
"kids": [],
"object": "178 0 R",
"open": true,
"title": "Piece 2"
}
],
"object": "176 0 R",
"open": true,
"title": "Author 1"
},
{
"dest": [
"38 0 R",
"/Fit"
],
"destpageposfrom1": 4,
"kids": [
{
"dest": [
"38 0 R",
"/Fit"
],
"destpageposfrom1": 4,
"kids": [],
"object": "180 0 R",
"open": true,
"title": "Piece 3"
},
{
"dest": [
"45 0 R",
"/Fit"
],
"destpageposfrom1": 5,
"kids": [
{
"dest": [
"45 0 R",
"/Fit"
],
"destpageposfrom1": 5,
"kids": [],
"object": "182 0 R",
"open": true,
"title": "Movement 1"
},
{
"dest": [
"53 0 R",
"/Fit"
],
"destpageposfrom1": 6,
"kids": [],
"object": "183 0 R",
"open": true,
"title": "Movement 2"
},
{
"dest": [
"58 0 R",
"/Fit"
],
"destpageposfrom1": 7,
"kids": [],
"object": "184 0 R",
"open": true,
"title": "Movement 3"
}
],
"object": "181 0 R",
"open": true,
"title": "Piece 4"
},
{
"dest": [
"79 0 R",
"/Fit"
],
"destpageposfrom1": 8,
"kids": [],
"object": "185 0 R",
"open": true,
"title": "Piece 5"
}
],
"object": "179 0 R",
"open": true,
"title": "Author 2"
}
],
"object": "175 0 R",
"open": true,
"title": "Contents"
}
]
}
As you see, the key 'title' is used at different levels of the tree, and each node could have a 'kids' array, and so on... I cannot guess how deeply the JSON file will be nested.
I would like to create a csv file which would retain only the last "title" occurrence in each branch, but also memorize all the path to get to it and its destination page. To be more specific, I would like to create something like this:
Piece 1;Author 1;Content;2
Piece 2;Author 1;Content;3
Piece 3;Author 2;Content;4
Movement 1;Piece 4;Author 2;Content;5
Movement 2;Piece 4;Author 2;Content;6
Movement 3;Piece 4;Author 2;Content;7
Piece 5;Author 2;Content;8
I am struggling to achieve that for a few days now, reading what I could find on the internet about JQ, but I did not succeed to find a similar problem with a proper solution. Any help at this point would be deeply appreciated!
Thanks!
Given your sample JSON input, you can use jq to
.kids
(the leaves)"kids"
(the levels)getpath
, then compose from it an array of values (the records)@csv
(and jq -r
)path(.. | objects | select(.kids == [])) as $p
| [getpath($p | .[:.[["kids"]][], length])]
| [reverse[].title, last.destpageposfrom1]
| @csv
"Piece 1","Author 1","Contents",2
"Piece 2","Author 1","Contents",3
"Piece 3","Author 2","Contents",4
"Movement 1","Piece 4","Author 2","Contents",5
"Movement 2","Piece 4","Author 2","Contents",6
"Movement 3","Piece 4","Author 2","Contents",7
"Piece 5","Author 2","Contents",8
Note that @csv
uses ,
as item delimiter, and escapes strings using double quotes "
. Use e.g. map
and join
to create another formatting.