regexnpmsemantic-versioningcomputation-theory

NPM Version Range Grammar not regular?


When specifying required versions of a dependency in npm, we can specify ranges of versions. For example: 1.2.7 || >=1.2.9 <2.0.0 Specification: https://github.com/npm/node-semver?tab=readme-ov-file#ranges

BNF:

range-set  ::= range ( logical-or range ) *
logical-or ::= ( ' ' ) * '||' ( ' ' ) *
range      ::= hyphen | simple ( ' ' simple ) * | ''
hyphen     ::= partial ' - ' partial
simple     ::= primitive | partial | tilde | caret
primitive  ::= ( '<' | '>' | '>=' | '<=' | '=' ) partial
partial    ::= xr ( '.' xr ( '.' xr qualifier ? )? )?
xr         ::= 'x' | 'X' | '*' | nr
nr         ::= '0' | ['1'-'9'] ( ['0'-'9'] ) *
tilde      ::= '~' partial
caret      ::= '^' partial
qualifier  ::= ( '-' pre )? ( '+' build )?
pre        ::= parts
build      ::= parts
parts      ::= part ( '.' part ) *
part       ::= nr | [-0-9A-Za-z]+

I'm writing a json schema where some string fields should require version range definitions that match this npm version range grammar. But so far, the only way I know for enforcing a string to have a particular format is by defining a regular expression that it should match. Now I think that the npm version range grammar is not regular, but I haven't managed to prove this. Can you confirm / reject this hypothesis? If you reject it, what would be the regex describing that grammar?

And if the npm version range grammar is not regular, is there any other way of checking whether the string in a field matches it?

I have tried finding a regex for the npm version range grammar, but haven't found one. I have tried applying the pumping lemma to prove that the grammar is not regular, but I have failed to come up with a word in the language that is longer than n and is no longer part of the language when the second part of its subdivision is pumped up.


Solution

  • This BNF describes a regular language. Its rule tree has no cycles. So you can translate the rules into regex in a bottom-up way until you reach the root.

    Here is such a translation:

    BNF rule regex
    logical-or ([ ]*\|\|[ ]*)
    nr (0|[1-9]\d*)
    xr ([xX*0]|[1-9]\d*)
    part [-\dA-Za-z]+
    parts [-\dA-Za-z]+(\.[-\dA-Za-z]+)*
    build [-\dA-Za-z]+(\.[-\dA-Za-z]+)*
    pre [-\dA-Za-z]+(\.[-\dA-Za-z]+)*
    qualifier (-[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?(\+[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?
    partial ([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(-[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?(\+[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?)?)?
    caret \^([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(-[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?(\+[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?)?)?
    tilde ~([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(-[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?(\+[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?)?)?
    primitive ([<>]=?|=)([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(-[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?(\+[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?)?)?
    simple ([~^=]|[<>]=?)?([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(-[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?(\+[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?)?)?
    hyphen ([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(-[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?(\+[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?)?)?-([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(-[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?(\+[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?)?)?
    range (([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(-[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?(\+[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?)?)?-([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(-[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?(\+[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?)?)?|([~^=]|[<>]=?)?([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(-[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?(\+[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?)?)?([ ]([~^=]|[<>]=?)?([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(-[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?(\+[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?)?)?)*)?
    range-set (([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(-[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?(\+[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?)?)?-([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(-[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?(\+[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?)?)?|([~^=]|[<>]=?)?([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(-[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?(\+[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?)?)?([ ]([~^=]|[<>]=?)?([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(-[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?(\+[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?)?)?)*)?(([ ]*\|\|[ ]*)(([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(-[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?(\+[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?)?)?-([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(-[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?(\+[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?)?)?|([~^=]|[<>]=?)?([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(-[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?(\+[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?)?)?([ ]([~^=]|[<>]=?)?([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(\.([xX*0]|[1-9]\d*)(-[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?(\+[-\dA-Za-z]+(\.[-\dA-Za-z]+)*)?)?)?)*)?)*

    Remarks

    If you require that the whole input represents a match, then you'll need to add the ^ and $ assertions around the regex.

    The regex is quite long because of multiple repetitions of the same pattern. It should be possible to reduce it with the use of look-around assertions, but at least this should demonstrate that it is possible.

    The BNF seems quite allowing. For instance, it allows range to be '', which I understand to be the empty string. That means that a range-set can have an empty OR-operand, like 1.2.3 || || 4.5.6 or could start or end with ||. This behaviour got of course translated into the regex.