rubyregexicalendarrfc2445rfc5545

Regex parsing of iCalendar (Ruby regex)


I'm trying to parse iCalendar (RFC2445) input using a regex.

Here's a [simplified] example of what the input looks like:

BEGIN:VEVENT
abc:123
def:456
END:VEVENT
BEGIN:VEVENT
ghi:789
END:VEVENT

I'd like to get an array of matches: the "outer" match is each VEVENT block and the inner matches are each of the field:value pairs.

I've tried variants of this:

BEGIN:VEVENT\n((?<field>(?<name>\S+):\s*(?<value>\S+)\n)+?)END:VEVENT

But given the input above, the result seems to have only ONE field for each matching VEVENT, despite the +? on the capture group:

**Match 1**
field   def:456
name    def
value   456

**Match 2**
field   ghi:789
name    ghi
value   789

In the first match, I would have expected TWO fields: the abc:123 and the def:456 matches...

I'm sure this is a newbie mistake (since I seem to perpetually be a newbie when it comes to regex's...) - but maybe you can point me in the right direction?

Thanks!


Solution

  • You need to split your regex up into one matching a VEVENT and one matching the name/value pairs. You can then use nested scan to find all occurences, e. g.

    str.scan(/BEGIN:VEVENT((?<vevent>.+?))END:VEVENT/m) do
      $~[:vevent].scan(/(?<field>(?<name>\S+?):\s*(?<value>\S+?))/) do
        p $~[:field], $~[:name], $~[:value]
      end
    end
    

    where str is your input. This outputs:

    "abc:1"
    "abc"
    "1"
    "def:4"
    "def"
    "4"
    "ghi:7"
    "ghi"
    "7"
    

    If you want to make the code more readable, i suggest you require 'english' and replace $~ with $LAST_MATCH_INFO