javaarraylistsampling

How to consistently retain an exact percentage of elements from a List in Java?


I'm currently filtering a Java List to keep approximately a percentage of its elements using a random approach:

import java.util.List;
import java.util.Random;

public class Main {
    public static void main(String[] args) {
        List<String> list = List.of("1", "2", "3", "4", "5", "6", "7", "8", "9", "10");

        // Percentage of elements to retain (between 0 and 100)
        int percentageToRetain = 30; // Change this value to adjust the percentage

        Random random = new Random();
        List<String> result = list.stream()
                .filter(s -> random.nextInt(100) < percentageToRetain)
                .toList();

        System.out.println(result);
    }
}

While this works for approximating the percentage, each run produces different results, and the actual percentage retained can vary significantly from my target, especially with smaller lists.

For example: When targeting 30% of a 10-element list, I might get 2 elements (20%) on one run and 4 elements (40%) on another.

My question: How can I modify my code to consistently retain exactly the specified percentage of elements from a List? Ideally, I'd like a solution that:

How to handle this kind of "exact percentage sampling" problem?


Solution

  • Your Approach

    @Marce Puente correctly points outin the question comments that retaining an exact percentage of elements is impossible, at least for smaller list sizes.

    the idea of “exact percentage” is already wrong, imagine such a list: [ 10, 5, 1 ] and you want to keep it to *50%, what would you put, an item and a half, no, you have to set a tolerance, which will be determined by the number of item's.

    For larger input sizes the rounding error decreases.

    Another problem, as the busybee pointed out, is that discarding elements with a chance is not guranteed to return a stable amount of elements.

    To illustrate that, here are 10 sets of random numbers between 0 and 100 (I noted the number of elements that would be retained given percentageToRetain = 30 in brackets):

    0 64 93 11 77 13 82 87 8 68 (4 retains)
    77 60 39 97 64 46 33 42 87 79 (0 retains)
    26 91 0 32 49 31 15 7 2 7 (5 retains)
    95 90 38 10 70 37 32 17 82 98 (2 retains)
    11 43 96 58 10 40 30 4 40 54 (3 retains)
    2 66 86 87 17 88 92 64 10 79 (4 retains)
    3 86 75 51 30 37 2 96 49 89 (2 retains)
    9 10 23 84 52 28 32 14 12 33 (5 retains)
    53 91 54 66 28 61 67 35 45 70 (1 retains)
    3 77 79 24 37 60 6 80 8 39 (4 retains)
    

    My suggestion

    1. calculate the.number of elements you wish to retain: int numberOfElementsToRetain = (int) (source.size() * percentageToRetain / 100.0);
    2. make a copy of the input list to a known mutable type (you can skip this step if the input is guranteed to be mutable & you don't mind ruining it)
    3. shuffle it
    4. copy the first numberOfElementsToRetain elements to a new list & return that.

    Thus we get:

    public static <E> List<E> retainNumberOfElementsRandomly2(List<E> source, int percentageToRetain){
        int numberOfElementsToRetain = (int) (source.size() * percentageToRetain / 100.0);
        ArrayList<E> defcopy = new ArrayList<>(source);
        
        Random r = new Random();
        
        Collections.shuffle(defcopy, r);
    
        List<E> result = new ArrayList<>(numberOfElementsToRetain);
        
        for(int i = 0; i < numberOfElementsToRetain; i++){
            result.add(defcopy.get(i));
        }
        
        return result;
    }