jqwebvtt

WebVTT to JSON using `jq`


I'd like to convert some WebVTT to JSON. I have a jq filter, but it could be improved. jqplay.org example

Here's a snippet of example WebVTT.

WEBVTT

dc9af9e8-8909-11ee-80b1-44850052397e
00:00:08.280 --> 00:00:08.880
Good morning Mr. Phelps.

ddf6092c-8909-11ee-bc4f-44850052397e
00:00:09.720 --> 00:00:12.840
Your mission, should you choose to
accept it,

e03b2da2-8909-11ee-a212-44850052397e
00:17:52.052 --> 00:17:55.586
This tape will self-destruct in 5
seconds.

Using this filter (with --null-input --raw-input options),

def rtrim: rtrimstr("\r");
input|rtrim|{filename: input_filename, filetype: ., cues:
[foreach inputs as $i
  ( {start:false, theEnd:false};
    if ($i|test("^[a-f0-9]{8}-"))
    then .start=true|.theEnd=false|.out={uid: $i|rtrim, timeline: input|rtrim, subtitles: [input|rtrim]} 
    else
      if ($i|test("^\\s*$"))
      then .theEnd=.start
      else if .start then .out+={subtitles: (.out.subtitles+[$i|rtrim])} else .theEnd=false end
      end
    end;
    if (.start and .theEnd) then [.out] else empty end
  )
| add]
}

... the desired output is produced for my example files.

{
  "filename": "<stdin>",
  "filetype": "WEBVTT",
  "cues": [
    {
      "uid": "dc9af9e8-8909-11ee-80b1-44850052397e",
      "timeline": "00:00:08.280 --> 00:00:08.880",
      "subtitles": [
        "Good morning Mr. Phelps."
      ]
    },
    {
      "uid": "ddf6092c-8909-11ee-bc4f-44850052397e",
      "timeline": "00:00:09.720 --> 00:00:12.840",
      "subtitles": [
        "Your mission, should you choose to",
        "accept it,"
      ]
    },
    {
      "uid": "e03b2da2-8909-11ee-a212-44850052397e",
      "timeline": "00:17:52.052 --> 00:17:55.586",
      "subtitles": [
        "This tape will self-destruct in 5",
        "seconds."
      ]
    }
  ]
}

rtrim is used in case the WebVTT file has DOS \r\n line endings. All if statements have an else clause since some versions of jq require it.

Issues

  1. Filter requires an extra newline at the end of the file to emit the last cue.
  2. Filter is mostly sufficient, but a bit of a mess. It was patched until the desired output appeared. It could be improved.

I know there are other solutions/libraries/apps to process WebVTT, but I'm curious to know if jq can do it on its own.


Solution

  • The WebVTT file format allows for many more content items, e.g. text headers, cue settings, NOTE blocks, STYLE blocks, etc., even a BOM at the beginning. But if you're happy considering only the subset used in your example (backed by your statement that your sample filter produces the desired output), have a look atthe following rewrite:

    Line endings

    rtrim is used in case the WebVTT file has DOS \r\n line endings.

    You could read the entire input into one string using jq --raw-input --slurp, then remove all carriage return characters using gsub("\r"; ""), so from then on you only have to deal with newline characters.

    Final newline

    Filter requires an extra newline at the end of the file to emit the last cue.

    To partition the long string into empty-line-delimited blocks, it can be split into an array at any occurrence of at least two consecutive newline characters (i.e. at least one empty line) using the regex \n\n+. That way, it doesn't matter if there is no newline at the end. Combine it with the regex \n$ to also catch (and eliminate) a single newline ending: splits("\n(\n+|$)")

    Output composition

    The first block of text is considered to be the line containing the .filetype, and the remaining items in the .[1:] array are the blocks that become the items in the .cues array. Use map to transform each item by first splitting them again into lines using /, then drop any items that had no lines using select(has(0)), and distribute the first (.[0]), second (.[1]) and all remaining lines (.[2:]) to .uid, .timeline, and .subtitles, respectively.

    # jq -Rs
    
    gsub("\r"; "") | [splits("\n(\n+|$)")] | {
      filename: input_filename, filetype: first,
      cues: .[1:] | map(. / "\n" | select(has(0)) | {
        uid: .[0], timeline: .[1], subtitles: .[2:]
      })
    }
    
    {
      "filename": "<stdin>",
      "filetype": "WEBVTT",
      "cues": [
        {
          "uid": "dc9af9e8-8909-11ee-80b1-44850052397e",
          "timeline": "00:00:08.280 --> 00:00:08.880",
          "subtitles": [
            "Good morning Mr. Phelps."
          ]
        },
        {
          "uid": "ddf6092c-8909-11ee-bc4f-44850052397e",
          "timeline": "00:00:09.720 --> 00:00:12.840",
          "subtitles": [
            "Your mission, should you choose to",
            "accept it,"
          ]
        },
        {
          "uid": "e03b2da2-8909-11ee-a212-44850052397e",
          "timeline": "00:17:52.052 --> 00:17:55.586",
          "subtitles": [
            "This tape will self-destruct in 5",
            "seconds."
          ]
        }
      ]
    }
    

    Demo