javaregex

Removing unquoted substrings from input string


Something got me wondering today, is there a regex that matches [ADD_CONDITION] in this string

DELETE FROM solution_vac.refuse_diseases WHERE refuse_id = 123 [ADD_CONDITION]

but doesn't match anything at all in this string

select regexp_replace('abc123', '[^[:digit:]]', '') from dual

? Now, let me describe my problem

We have SQL strings with placeholders. We may replace them with subqueries, extra fields, etc. Sometimes, we don't need to do that. In that case, we need to write pretty verbose code to remove those placeholders. I didn't like that so I wrote this neat method (javadoc's omitted):

    private static String removePlaceholders(String sql) {
        if (StringUtils.isEmpty(sql)) return null;
        String placeholderPattern = "\\[.*]";
        return sql.replaceAll(placeholderPattern, "");
    }

It landed me in something of an oopsy-daisy today. I thought I thought it all through. We don't plan to migrate to Microsoft's DBs where square brackets are supported, it's all good, I figured. However, my pattern matched quoted brackets as well. So essential parts of SQL queries were removed too (see the example above)

Is there a concise way to make sure only the placeholders are removed, never quoted square brackets? If so, how? I guess the key is the number of quotes before the bracket (if it's odd, leave the brackets alone). But I couldn't think of any regex that took that into account

You're free to write code too, if it cannot be avoided, not just regex

Java 8. Apache Commons is also available

<!-- this one -->
        <dependency>
            <groupId>commons-lang</groupId>
            <artifactId>commons-lang</artifactId>
            <version>2.6</version>
        </dependency>
import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.params.ParameterizedTest;
import org.junit.jupiter.params.provider.CsvSource;

import java.sql.SQLException;

public class QueryTest {

    @Test
    void queryParserSquareBracketsAsPlaceholder() {
        String sql = "DELETE FROM solution_vac.refuse_diseases\n" +
                " WHERE refuse_id = 123\n" +
                "[ADD_CONDITION]";

        String expected = "DELETE FROM solution_vac.refuse_diseases\n" +
                " WHERE refuse_id = 123";

        assertQuery(expected, sql);
    }

    @Test
    void queryParserSquareBracketsInLiteral()  {
        String sql = "select regexp_replace('abc123', '[^[:digit:]]', '') from dual";

        String expected = sql;

        assertQuery(expected, sql);
    }

    @ParameterizedTest
    @CsvSource(value = {
            "sql01 [TEST1][TEST2], sql01",
            "sql02 [], sql02",
            "[A1][A2] sql, sql",
            "[A1] sql [A2] sql2 [A3], sql  sql2",
            "sql03 [ 'TEST''  ], sql03",
            "sql04 /*[TEST*/ sql2 [TEST2], sql04  sql2",
            "sql05 :ID '[TEST1]' [TEST2], sql05 NULL '[TEST1]'",
            "sql06 [TEST1 sql2, sql06 [TEST1 sql2",
            "sql07 'str [TEST1], sql07 'str"

    })
    void queryParserSquareBracketsSpecialCases(String input, String expectedOutput) {
        assertQuery(expectedOutput, input);
    }

    private void assertQuery(String sqlExpected, String sqlInput) {
            String sqlParsed = removePlaceholders(sqlInput);

            Assertions.assertEquals(sqlExpected, sqlParsed);
    }

    private static String removePlaceholders(String sqlString) {
        // your implementation's here...
    }
}

Solution

  • The common approach is to also match quoted strings, but reproduce those in the replace argument with a back reference ($1). In case the match was a bracketed expression, that back reference will be the empty string.

    Like this:

            String placeholderPattern = "('.*?')|\\[[^\\[]*?]";
            return sql.replaceAll(placeholderPattern, "$1");
    

    Note that you have an error in the first test case, where expected is missing a final \n -- that character is not part of a bracketed expression, so it should not be expected to be removed. Similarly, in other test cases you have white space differences, like test case "sql02 [], sql02" should probably better be defined as "sql02 [], sql02 " (with trailing space).

    Other requirements

    Some test cases seem to express requirements that are not the topic of your question:

    I see a test case where apparently you want /* */ sequences to be removed. You can extend the above regex to omit those as well:

            String placeholderPattern = "/\\*.*?\\*/|('.*?')|\\[[^\\[]*?]";
    

    There is also a test case where you remove the occurrence of :ID. That seems a task for binding variables, which should be done by whatever SQL API you use.