javaregexstringkotlinreplace

Efficient way to remove unwanted characters from a string in Kotlin


I'm working on a kotlin project that needs to process some phone number data and convert it into a format suitable for a third party API to consume. The requirements are:

  1. No non-digit characters except in the case of international numbers
  2. International numbers may start with a '+' character

It sounds simple, and in most cases it is. However there is one edge case I'm not sure how to solve. If there is a phone number with one or more unwanted characters before the '+' character, how do I filter those out without loosing the '+' as well? I could do the operation in two passes, but this is a time sensitive function that will need to be run on thousands of records. Is there a more efficient way to solve this?

My current function:

fun String.convertPhoneNumber():String
{
    //trim leading/trailing space
    return this.trim().filterIndexed{ index, value ->
        //trim any non-digit characters except for a leading '+'
        (index == 0 && value == '+') || value.isDigit()
    }
}

Test code:

@Test
fun `test convert phone number`()
{
        //international numbers
        val expectedInt = "+15705554444"
        val phone7 = "+15705554444"
        assertThat(phone7.convertPhoneNumber()).isEqualTo(expectedInt)

        val phone8 = " + 1 570-555 4444"
        assertThat(phone8.convertPhoneNumber()).isEqualTo(expectedInt)

        //this test will fail
        val phone9 = " (+1)57055554444"
        assertThat(phone9.convertPhoneNumber()).isEqualTo(expectedInt)
}

Solution

  • This regex pattern keeps the first + (if located before any digits). Retains all digits. Removes all non-digit characters, including + characters in between digits.

    REGEX PATTERN (Java 8 regex flavor, flags: g):

    (?:^[^\d+]*(\+))|\D
    

    REPLACEMENT STRING:

    $1
    

    Regex Demo: https://regex101.com/r/y7esly/10 (Note: In the multiline test string demo, added to the newline character (\n) alternative into the pattern (|\n|), and $2 into the replacement (substitution) string, to keep the lines. \|n and $2 are removed from the actual response, where we are not dealing with multiline input.)

    TRY:

    # Syntax from comment to Question by @Wiktor Stribiżew:
    return this.replace("""(?:^[^\d+]*(\+))|\D""".toRegex(), "$1"
    

    TEST STRINGS:

    +15705554444
    +15705554444
    + 1 570-555 4444
    (+1)57055554444
    (+1)570555-54444
    (1)57055554444
    (1)5705555+4444
    (1)+5705555+4444
    (+1)(570)555 5+4444
    1570+555+4444
    15705554444+9
    

    RETURNS:

    +15705554444
    +15705554444
    +15705554444
    +157055554444
    +157055554444
    157055554444
    157055554444
    157055554444
    +157055554444
    15705554444
    157055544449
    

    REGEX NOTES:

    REPLACEMENT STRING: