I'm working on a kotlin project that needs to process some phone number data and convert it into a format suitable for a third party API to consume. The requirements are:
It sounds simple, and in most cases it is. However there is one edge case I'm not sure how to solve. If there is a phone number with one or more unwanted characters before the '+' character, how do I filter those out without loosing the '+' as well? I could do the operation in two passes, but this is a time sensitive function that will need to be run on thousands of records. Is there a more efficient way to solve this?
My current function:
fun String.convertPhoneNumber():String
{
//trim leading/trailing space
return this.trim().filterIndexed{ index, value ->
//trim any non-digit characters except for a leading '+'
(index == 0 && value == '+') || value.isDigit()
}
}
Test code:
@Test
fun `test convert phone number`()
{
//international numbers
val expectedInt = "+15705554444"
val phone7 = "+15705554444"
assertThat(phone7.convertPhoneNumber()).isEqualTo(expectedInt)
val phone8 = " + 1 570-555 4444"
assertThat(phone8.convertPhoneNumber()).isEqualTo(expectedInt)
//this test will fail
val phone9 = " (+1)57055554444"
assertThat(phone9.convertPhoneNumber()).isEqualTo(expectedInt)
}
This regex pattern keeps the first +
(if located before any digits). Retains all digits. Removes all non-digit characters, including +
characters in between digits.
REGEX PATTERN (Java 8 regex flavor, flags: g):
(?:^[^\d+]*(\+))|\D
REPLACEMENT STRING:
$1
Regex Demo: https://regex101.com/r/y7esly/10
(Note: In the multiline test string demo, added to the newline character (\n
) alternative into the pattern (|\n|
), and $2
into the replacement (substitution) string, to keep the lines. \|n
and $2
are removed from the actual response, where we are not dealing with multiline input.)
TRY:
# Syntax from comment to Question by @Wiktor Stribiżew:
return this.replace("""(?:^[^\d+]*(\+))|\D""".toRegex(), "$1"
TEST STRINGS:
+15705554444
+15705554444
+ 1 570-555 4444
(+1)57055554444
(+1)570555-54444
(1)57055554444
(1)5705555+4444
(1)+5705555+4444
(+1)(570)555 5+4444
1570+555+4444
15705554444+9
RETURNS:
+15705554444
+15705554444
+15705554444
+157055554444
+157055554444
157055554444
157055554444
157055554444
+157055554444
15705554444
157055544449
REGEX NOTES:
(?:
Begin the first alternative. Non-capturing group (?:...)
used to explicitly distinguish the alternative on the left side of the or (|
).
^
Match the beginning of the string.[^\d+]*
Negated character class [^...]
. Matches any character that is not a digit \d
or literal +
0 or more times (*
).(\+)
First capture group (...)
. The character(s) in this group are retrieved with $1
in the replacement string. Matches literal +
(+
is a special character in regex, so it should be escaped outside a character class, (\+
).)
Close first alternative.|
OR in alternation ...|...
. Regex engine tries to match the alternatives from left to right.\D
The second alternative. Matches any character that is NOT a digit (\d
). (Note: \D
matches a newline character. This why in the regex demo (link) I added |\n
to the pattern and $2
to the replacement string, to keep the lines separate for clarity.)REPLACEMENT STRING:
$1
. All other matched characters are replaced with nothing, i.e. deleted from the string.\d
:$1
: If there is a match for the first alternative, all matched non-digit, non-plus characters before the first +
are replaced with nothing (i.e. they are effectively deleted from the string). Only the +
is kept, because the characters matched with the first capture group, (\+)
, are contained the group 1, $1
. If the first alternative does not match the first capture group $1
represents and empty string, "".\D
, replace it with nothing. This effectively removes any non-digit characters from the string, except the first +
located before any digits.