algorithmdata-structuressegment-treefenwick-tree

Is it possible to query number of distinct integers in a range in O(lg N)?


I have read through some tutorials about two common data structure which can achieve range update and query in O(lg N): Segment tree and Binary Indexed Tree (BIT / Fenwick Tree).

Most of the examples I have found is about some associative and commutative operation like "Sum of integers in a range", "XOR integers in a range", etc.

I wonder if these two data structures (or any other data structures / algorithm, please propose) can achieve the below query in O(lg N)? (If no, how about O(sqrt N))

Given an array of integer A, query the number of distinct integer in a range [l,r]

PS: Assuming the number of available integer is ~ 10^5, so used[color] = true or bitmask is not possible

For example: A = [1,2,3,2,4,3,1], query([2,5]) = 3, where the range index is 0-based.


Solution

  • Yes, this is possible to do in O(log n), even if you should answer queries online. However, this requires some rather complex techniques.

    First, let's solve the following problem: given an array, answer the queries of form "how many numbers <= x are there within indices [l, r]". This is done with a segment-tree-like structure which is sometimes called Merge Sort Tree. It is basically a segment tree where each node stores a sorted subarray. This structure requires O(n log n) memory (because there are log n layers and each of them requires storing n numbers). It is built in O(n log n) as well: you just go bottom-up and for each inner vertex merge sorted lists of its children.

    Here is an example. Say 1 5 2 6 8 4 7 1 be an original array.

    |1 1 2 4 5 6 7 8|
    |1 2 5 6|1 4 7 8|
    |1 5|2 6|4 8|1 7|
    |1|5|2|6|8|4|7|1|
    

    Now you can answer for those queries in O(log^2 n time): just make a reqular query to a segment tree (traversing O(log n) nodes) and make a binary search to know how many numbers <= x are there in that node (additional O(log n) from here).

    This can be speed up to O(log n) using Fractional Cascading technique, which basically allows you to do the binary search not in each node but only in the root. However it is complex enough to be described in the post.

    Now we return to the original problem. Assume you have an array a_1, ..., a_n. Build another array b_1, ..., b_n, where b_i = index of the next occurrence of a_i in the array, or ∞ if it is the last occurrence.

    Example (1-indexed):

    a = 1 3 1 2 2 1 4 1
    b = 3 ∞ 6 5 ∞ 8 ∞ ∞
    

    Now let's count numbers in [l, r]. For each unique number we'll count its last occurrence in the segment. With b_i notion you can see that the occurrence of the number is last if and only if b_i > r. So the problem boils down to "how many numbers > r are there in the segment [l, r]" which is trivially reduced to what I described above.

    Hope it helps.