rubyregexnokogirijrubysoap4r

RegexpError in Ruby when parsing \p{IsBasicLatin} character property


I'm using JRuby 1.7.18 and have even tried this in JRuby 9000 (latest version) where I get the same error. I'm using the soap-4r and nokogiri libraries to parse a wsdl xml file.

When the below part of the wsdl is parsed

<xs:pattern value="[\p{IsBasicLatin}]*"/>

I get the following error

RegexpError: (RegexpError) invalid character property name <IsBasicLatin>: /\A[\p{IsBasicLatin}]*\z/n
nokogiri/XmlSaxParserContext.java:252:in `parse_with'
nokogiri/XmlSaxParserContext.java:252:in `parse_with'
nokogiri/XmlSaxParserContext.java:252:in `parse_with'

In Ruby 1.9, which is one of the Ruby versions that JRuby 1.7.18 is compatible with, I read that character blocks like \p{IsBasicLatin} are not supported. But scripts like \p{Latin} are supported. I've tried changing IsBasicLatin to Latin and even tried a few other ones like InBasicLatin and InBasic_Latin but they all return the same error.

This is both in JRuby 1.7.18 and JRuby 9000 which is the latest version.

What is going wrong here and how can I fix it?


Solution

  • As mentioned in the comments the name of the character property is actually In_Basic_Latin and not IsBasicLatin. Modern versions of Ruby (MRI or CRuby to be specific) use the regular expression library Onigmo. The official Ruby docs don't list all Unicode properties but luckily Onigmo does.

    Apparently JRuby doesn't seem to implement (at least) the Unicode block ones. However information (name and range) about blocks are publicly accessible. \p{In_Basic_Latin} is therefore equivalent to [\u0000-\u007F]. So is [[:ascii:]].