Context, I'm trying to port a Perl code into Python from https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/normalize-punctuation.perl#L87 and there is this regex here in Perl:
s/(\d) (\d)/$1.$2/g;
If I try it with the Perl script given the input text 123 45
, it returns the same string with a dot. As a sanity check, I've tried on the command line too:
echo "123 45" | perl -pe 's/(\d) (\d)/$1.$2/g;'
[out]:
123.45
And it does so too when I convert the regex to Python,
>>> import re
>>> r, s = r'(\d) (\d)', '\g<1>.\g<2>'
>>> print(re.sub(r, s, '123 45'))
123.45
But when I use the Moses script:
$ wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/normalize-punctuation.perl
--2019-03-19 12:33:09-- https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/normalize-punctuation.perl
Resolving raw.githubusercontent.com... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 905 [text/plain]
Saving to: 'normalize-punctuation.perl'
normalize-punctuation.perl 100%[================================================>] 905 --.-KB/s in 0s
2019-03-19 12:33:09 (8.72 MB/s) - 'normalize-punctuation.perl' saved [1912]
$ echo "123 45" > foobar
$ perl normalize-punctuation.perl < foobar
123 45
Even when we try to print the string before and after the regex in the Moses code, i.e.
if ($language eq "de" || $language eq "es" || $language eq "cz" || $language eq "cs" || $language eq "fr") {
s/(\d) (\d)/$1,$2/g;
}
else {
print $_;
s/(\d) (\d)/$1.$2/g;
print $_;
}
[out]:
123 45
123 45
123 45
We see that before and after the regex, there's no change in the string.
My question in parts are:
\g<1>.\g<2>
regex equivalent to the Perl's $1.$2
?.
between the two digit groups in Moses? The reason why this code from moose doesn't work is because it search for non-breaking space, not just space. It is not easy to see, but hexdump
could help you with that:
fe-laptop-p:moose fe$ head -n87 normalize-punctuation.perl | tail -n1 | hexdump -C
00000000 09 73 2f 28 5c 64 29 c2 a0 28 5c 64 29 2f 24 31 |.s/(\d)..(\d)/$1|
00000010 2e 24 32 2f 67 3b 0a |.$2/g;.|
00000017
fe-laptop-p:moose fe$ head -n87 normalize-punctuation.perl.with_space | tail -n1 | hexdump -C
00000000 09 73 2f 28 5c 64 29 20 28 5c 64 29 2f 24 31 2e |.s/(\d) (\d)/$1.|
00000010 24 32 2f 67 3b 0a |$2/g;.|
00000016
See the difference: c2 a0
vs 20
?
p.s. as for comments about adding plus sign to regex: it is not needed here, as it is enough to put dot sign between two adjacent digits and no need to find full numbers