javaalgorithmsubstring

How to find the longest substring with equal amount of characters efficiently


I have a string that consists of characters A,B,C and D and I am trying to calculate the length of the longest substring that has an equal amount of each one of these characters in any order. For example ABCDB would return 4, ABCC 0 and ADDBCCBA 8.

My code currently:

    public int longestSubstring(String word) {
        HashMap<Integer, String> map = new HashMap<Integer, String>();

        for (int i = 0; i<word.length()-3; i++) {
            map.put(i, word.substring(i, i+4));
        }

        StringBuilder sb;   

        int longest = 0;

        for (int i = 0; i<map.size(); i++) {
            sb = new StringBuilder();
            sb.append(map.get(i));

            int a = 4;

            while (i<map.size()-a) {
                sb.append(map.get(i+a));
                a+= 4;
            }

            String substring = sb.toString();

            if (equalAmountOfCharacters(substring)) {
                int length = substring.length();
                if (length > longest)
                    longest = length;
            }

        }

    return longest;
}

This currently works pretty well if the string length is 10^4 but I'm trying to make it 10^5. Any tips or suggestions would be appreciated.


Solution

    1. Let's assume that cnt(c, i) is the number of occurrences of the character c in the prefix of length i.

    2. A substring (low, high] has an equal amount of two characters a and b iff cnt(a, high) - cnt(a, low) = cnt(b, high) - cnt(b, low), or, put it another way, cnt(b, high) - cnt(a, high) = cnt(b, low) - cnt(a, low). Thus, each position is described by a value of cnt(b, i) - cnt(a, i). Now we can generalize it for more that two characters: each position is described by a tuple (cnt(a_2, i) - cnt(a_1, i), ..., cnt(a_k, i) - cnt(a_1, i)), where a_1 ... a_k is the alphabet.

    3. We can iterate over the given string and maintain the current tuple. At each step, we should update the answer by checking the value of i - first_occurrence(current_tuple), where first_occurrence is a hash table that stores the first occurrence of each tuple seen so far. Do not forget to put a tuple of zeros to the hash map before iteration(it corresponds to an empty prefix).