algorithmsetsubsetmax-size

Find max subset of a huge set of integers


I have a huge set (S) of long unsigned integers in a .txt file. How can I find the max subset (Pmax) of S with the following property:

P{X1,X2,X3,...,Xn) | X1>=(Xn/4)

More details:

  1. When I say max subset I mean the subset with the greatest number of elements(n->max).
  2. I can't load .txt into an array because of limited memory.
  3. My system memory is 200MB
  4. txt file has 10^6 integers. Each integer can be long unsigned 32bit.
  5. I need to find the biggest subset of S with the condition:

X1 < X2 < X3 < ... < Xn-1 < Xn such as X1 >= (XN/4)

For example if the txt file has the following: 15,14,13,4,2,2,3,10,1,2,2 then these are the possible subsets:

P1(4,10,13,14,15)

P2(3,4,10)

P3(1,2,2,2,2,3,4)

so Pmax(1,2,2,2,2,3,4) because it has more elements.

In fact I don't want to find exactly which is the Pmax. I just want to find the number of elements of the subset Pmax. So here it is 7.

The algorithm should be really fast.

I don't look for someone to do my work. I just need a corresponding problem so I can look for the efficient solution. Thanks in advance!!!


Solution

  • The easiest solution is:

    1. Sort the list first (Complexity O(nlogn)
    2. With a moving window, find the largest acceptable window. (Complexity O(n))

    Complexity: O(nlogn).

    More details about step2:

    Let low keep track of the lowest element and high the highest element.

    Initialization: Set low to the first element. Do a binary search for 4*x[low], and that is your high location. Set maxWindow=high-low+1.

    At every step: Increment high by 1, and increment low such that x[low]>=x[high]. Calculate number of elements = high-low+1, and update maxWindow accordingly.