I'd like to convert some WebVTT to JSON. I have a jq
filter, but it could be improved.
jqplay.org example
Here's a snippet of example WebVTT.
WEBVTT
dc9af9e8-8909-11ee-80b1-44850052397e
00:00:08.280 --> 00:00:08.880
Good morning Mr. Phelps.
ddf6092c-8909-11ee-bc4f-44850052397e
00:00:09.720 --> 00:00:12.840
Your mission, should you choose to
accept it,
e03b2da2-8909-11ee-a212-44850052397e
00:17:52.052 --> 00:17:55.586
This tape will self-destruct in 5
seconds.
Using this filter (with --null-input --raw-input
options),
def rtrim: rtrimstr("\r");
input|rtrim|{filename: input_filename, filetype: ., cues:
[foreach inputs as $i
( {start:false, theEnd:false};
if ($i|test("^[a-f0-9]{8}-"))
then .start=true|.theEnd=false|.out={uid: $i|rtrim, timeline: input|rtrim, subtitles: [input|rtrim]}
else
if ($i|test("^\\s*$"))
then .theEnd=.start
else if .start then .out+={subtitles: (.out.subtitles+[$i|rtrim])} else .theEnd=false end
end
end;
if (.start and .theEnd) then [.out] else empty end
)
| add]
}
... the desired output is produced for my example files.
{
"filename": "<stdin>",
"filetype": "WEBVTT",
"cues": [
{
"uid": "dc9af9e8-8909-11ee-80b1-44850052397e",
"timeline": "00:00:08.280 --> 00:00:08.880",
"subtitles": [
"Good morning Mr. Phelps."
]
},
{
"uid": "ddf6092c-8909-11ee-bc4f-44850052397e",
"timeline": "00:00:09.720 --> 00:00:12.840",
"subtitles": [
"Your mission, should you choose to",
"accept it,"
]
},
{
"uid": "e03b2da2-8909-11ee-a212-44850052397e",
"timeline": "00:17:52.052 --> 00:17:55.586",
"subtitles": [
"This tape will self-destruct in 5",
"seconds."
]
}
]
}
rtrim
is used in case the WebVTT file has DOS \r\n
line endings. All if
statements have an else
clause since some versions of jq
require it.
I know there are other solutions/libraries/apps to process WebVTT, but I'm curious to know if jq
can do it on its own.
The WebVTT file format allows for many more content items, e.g. text headers, cue settings, NOTE
blocks, STYLE
blocks, etc., even a BOM at the beginning. But if you're happy considering only the subset used in your example (backed by your statement that your sample filter produces the desired output), have a look atthe following rewrite:
rtrim
is used in case the WebVTT file has DOS\r\n
line endings.
You could read the entire input into one string using jq --raw-input --slurp
, then remove all carriage return characters using gsub("\r"; "")
, so from then on you only have to deal with newline characters.
Filter requires an extra newline at the end of the file to emit the last cue.
To partition the long string into empty-line-delimited blocks, it can be split into an array at any occurrence of at least two consecutive newline characters (i.e. at least one empty line) using the regex \n\n+
. That way, it doesn't matter if there is no newline at the end. Combine it with the regex \n$
to also catch (and eliminate) a single newline ending: splits("\n(\n+|$)")
The first
block of text is considered to be the line containing the .filetype
, and the remaining items in the .[1:]
array are the blocks that become the items in the .cues
array. Use map
to transform each item by first splitting them again into lines using /
, then drop any items that had no lines using select(has(0))
, and distribute the first (.[0]
), second (.[1]
) and all remaining lines (.[2:]
) to .uid
, .timeline
, and .subtitles
, respectively.
# jq -Rs
gsub("\r"; "") | [splits("\n(\n+|$)")] | {
filename: input_filename, filetype: first,
cues: .[1:] | map(. / "\n" | select(has(0)) | {
uid: .[0], timeline: .[1], subtitles: .[2:]
})
}
{
"filename": "<stdin>",
"filetype": "WEBVTT",
"cues": [
{
"uid": "dc9af9e8-8909-11ee-80b1-44850052397e",
"timeline": "00:00:08.280 --> 00:00:08.880",
"subtitles": [
"Good morning Mr. Phelps."
]
},
{
"uid": "ddf6092c-8909-11ee-bc4f-44850052397e",
"timeline": "00:00:09.720 --> 00:00:12.840",
"subtitles": [
"Your mission, should you choose to",
"accept it,"
]
},
{
"uid": "e03b2da2-8909-11ee-a212-44850052397e",
"timeline": "00:17:52.052 --> 00:17:55.586",
"subtitles": [
"This tape will self-destruct in 5",
"seconds."
]
}
]
}