Related questions/material:
How to determine if a number is a prime with regex? (which deals with unary prime matching, while I'm looking for base ≥ 2; a nice trick nevertheless, and what got me to think about this)
http://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html
As is well known, the "regular expressions" supported by various programming languages generate languages that are non-regular in the formal sense and, as demonstrated in the above material, able to recognize at least some context sensitive languages.
The language L = {x | x is a prime number in base 10} is a context-sensitive language, since primality can be tested by a linear bounded automaton (but it is not a context-free language by a pumping lemma argument).
So, is it possible to write a Perl or Java regular expression which accepts precisely all prime numbers in base 10? Feel free to substitute any other base ≥ 2 or to recognize precisely all composite numbers if that feels easier.
Using escapes to, say, run arbitrary Perl code is considered cheating. Doing repeated substitutions (which is easily Turing complete) is also out of scope; the entire work should be done inside the regular expression. This question is more about the boundaries of how powerful regular expressions actually are.
NOTE: These Regexes where written in for PHP and use possessive quantifiers which are used in many but not all languages, for example java-script does not support them. Also this is very inefficient and will quickly become infeasible.
EDIT: here it is for base 10 \b(((\d)(?=[\d\s]*(\4{0,10}(n(?=.*n\3)|nn(?=.*1\3)|n{3}(?=.*2\3)|n{4}(?=.*3\3)|n{5}(?=.*4\3)|n{6}(?=.*5\3)|n{7}(?=.*6\3)|n{8}(?=.*7\3)|n{9}(?=.*8\3))?)))+)(?![\d\s]*(n(?=\4))++(..?1|(...*)\8+1))
I have used base 2 after this to make things easier.
EDIT: this one will allow you to pass in a string containing several binary numbers and matches those that are prime \b(((\d)(?=[\d\s]*(\4{0,2}n(?=.*\3)|\4{0,2})))+)(?![\d\s]*(n(?=\4))++(..?1|(...*)\7+1))
It basically does this by using boundary \b instead of start of string ^, it allows any number of decimals and spaces when moving forward to the ns and wraps the whole of the portion that tests the base 1 representations in a negative look-ahead. Apart from that it work in the same way as the one below.
As an example 1111 1011 nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn1
will match 1011
.
I have managed to get something I think works (checked to 25) and it matches non-primes. Here it is for base 2 (easier to explain) ^((\d)(?= \d*\s(\3{0,2}n(?=.*\2)|\3{0,2})))+\s(n(?=\3))*+\K(..?1|(..+?)\6+1)
this can be expanded up to base n, but this expands the regex very quickly. To get this regex to work I need a couple of perquisites (A bit hacky), the input string must be the number followed by a space followed by at least as many n characters as the value you of your number (if you had the number 10 you need at leat 10 ns after it) followed by the digits of your base in order excluding your 0 digit (e.g for base 10 123456789), not including your 0. For example 11 nnnnnnnnnnnnnn1
. This is due to the fact that the regexes have no accessible storage so I need to use capturing groups to do this. Finally this regex uses /x to ignore whitespaces in the expression, strip out all the space if you don't want to use this.
I will now explain how this works in 3 steps. This regex works in 3 parts:
Part 1:this part changes a base n > 1 to base 1 as a capturing group of ns
This is the part ^((\d)(?= \d*\s(\3{0,2}n(?=.*\2)|\3{0,2})))+
it works very similarly to the a^nb^n
example in the question. The ^ at the front means that the full match has to start at the beginning this is important for later.
The main structure of this code is ^((\d)(?= \d*\s (suff)))+
This takes each decimal between the start and the first space and performs a positive look-ahead using (\d)(?=) where \d is a decimal and (?=) is a look-ahead the \d is in a capturing group ()
for later on. It is the digit we are currently looking at.
The inside of the look-ahead is not actually to check a condition ahead but instead to build up a capturing group representing our number in base 1. The inside of the capturing group looks like this
\d*\s(\3{0,2}n(?=.*\2)|\3{0,2}))
The part \d*\s basically moves the characters we are looking at past the rest of the remaining digits \d* (\d is digit and * is 0 to n as many times as possible) this now leaves us looking at the start of the ns.
(\3{0,2}n(?=.*\2)|\3{0,2}))
is a self referencing capturing group here is where the need for the digits you have put at the end comes in. This group matches itself 0 to 2 times but as many times as possible (using \3{0,2} with \3 meaning caturing group 3 and {0,2} meaning match from 0 to 2 times) this means that if there is a number before the current digit its base 1 representation is multiplied by 2. This would be 10 for base 10 or 16 for base 16. If this is the first digit the group will be undefined so it will match 0 times. It then either adds a single n or no n based on matching the digit we are currently working on (using its capturing group). It does this by using a positive look ahead to look to the end of the input where we put the digits, n(?=.*\2) this matches n if it can find anything followed by the digit we are working on. This allows it to identify what digit it is we are working on at this point. (I would have used a look behind but they are fixed length) If you had base 3 and wanted to check if the digit you are currently working on is 2 you would use nn(?=.*1\2) this would match nn only if the digit was two. We have used an or operator '|' for all of these and if no digit is found we assume it is 0 and add no ns. As this is in the capturing group this matching is then saved in the group.
In summary of this part what we do is take each digit look ahead take the base 1 representation of the previous digits (saved in the capturing group )and multiply it by the base then match it, then add to it the base one representation of the digit and save it in the group. If you do this for each digit in turn you will get a base one representation of the number. Lets look at and example. 101 nnnnnnnnnnnnnnnnn1
First it goes to the sat because of ^. so :101 nnnnnnnnnnnnnnnnn1
Then it goes to the first digit and saves it in a capturing grope 101 nnnnnnnnnnnnnnnnn1
Group2 : 1
It uses a look-ahead using \d*\s to go past all the digits and the first space. 101 nnnnnnnnnnnnnnnnn1
It is now inside capturing group 3
It takes this caputing group's previous value and matches it 0 to 2 times
As it is undefined it matches 0 time.
It now looks ahead again to try to find a digit matching the digit in capturing group 2 101 nnnnnnnnnnnnnnnnn1
as it has been found it matches 1 n in capturing group 3 2 101 nnnnnnnnnnnnnnnnn1
It now leaves group 3, updating its value and leaves the look ahead Group 3 = n
It now looks at the next digit and saves that in a capturing group 101 nnnnnnnnnnnnnnnnn1
group 2 = 0
group 3 = n
It then also uses a look-ahead and goes to the first n 101 nnnnnnnnnnnnnnnnn1
It then matches group 3 0 to 2 time but as many as possible so n 101 nnnnnnnnnnnnnnnnn1
It then uses a look-ahead to try to match the digit in group 2 which it can do so it adds no ns, befor returning out of group 3 and the look-ahead
group3 = nn
It now looks at the next digit and saves it in group 2 101 nnnnnnnnnnnnnnnnn1 Using a look-ahead it goes to the start of the ns and matches 2 times group 3 101 nnnnnnnnnnnnnnnnn1 It then uses a look-ahead to try to match the digit in group 2 it finds it so matches a n and returns out of group 3 and the look-ahead. group3 = nnnnn Group 3 now contains the base 1 representation of our number.
Part 2 Reduces the ns to the size of the base 1 representation of your number
\s(n(?=\3))*+\K
This matches the space and then matches ns for as long as you can match group 3 (the base one representation of your number) in front. It does this by matching n as many times as possible using a *+ which is possessive (it never lets go of a matching this is to stop the matching from being shrunk later to make a match work) n has a posive look-ahead n(?=\3) which means n will be matched as long as there is a group 3 ahead of it (\3 gives capturing group 3). This leaves us with our base 1 representation and digits being the only thing left unmatched. We then us \K to say start the matching again from here.
Part3 We now use the same algorithm mentioned in the question to get primes apart from we force it not to match between the start of the base on representation and the start of the digits. You can read how that works Here How to determine if a number is a prime with regex?
Finally to make this into a base n regex you have to do a few things
^((\d)(?= \d*\s(\3{0,2}n(?=.*\2)|\3{0,2})))+\s(n(?=\3))*+\K(..?1|(..+?)\6+1)
fist add some more digits at the end of your input string then change the n
?=.*\2 to n?=.*n\2 | n?=.*1\2 n?=.*3\2 .., n?=.***n**\2
Finally change the \3{0,2} to \3{0,n}. where n is the base. Also remember that this will not work without the correct input string.