stringrandomlanguage-agnosticsample

Random sampling of non-overlapping substrings of length k


Given a string of length n, how would I (pseudo)randomly sample m substrings of size k such that none of the sampled substrings overlap? Most of my scripting experience is in Perl, but an easy-to-run solution in any common language will suffice.


Solution

  • If there is a character that cannot occur in the input, e.g. X, just:

    my $size = 20;
    my $count = 20;
    my $mark = 'X';
    my $input = 'CCACGCATTTTTGTTCATTGTTCTGGCTTCTTACAAGGTTCAGTAGACTTTGTAACACAGTTGTGTCTCTCACAGATTGGCAGATGTTTGGTAAAGGATTGACTTTTCAGCCAACTCATGGGAAAGTGAAATAATGTAAAAAACAGGAAGAATACAGTTTTAGGCCTTTCAAGTGAGGCATGGCTTTCAGCTCTTGGCAAGAACAGGCAAGGAGATGCAAGTTTTAGGACTCTAAGAGGCTAGGCTTTTCAAAGTGCTTCTCTCCCCTTCACCCTCCTTCAGTTACAGCACCAAGCACCACCGAGGTGTTACCTGCAGCCTCACTCTCTACCTGGTTGTGGGATCCTGCCACTTCCTTAACCCACACTGAGTTCCTTGTGGTTCACAGGGTCACACAGAGGGCTGTAGAGATACAAAAGATATATGTGATTTTATATCACCTATCATATGAAGATATATTTATAAAATAGGAAACATATTAACCACTTATCATTTTATATATTTATGGTTTTATGTGTCAAAAATATATTGTTTCATGTATGTATTAAAGGATAAGTATGTATAAGAGGTTTTATAGATGTGTAAAATTATATATTTATACGTATCTTTACAAATTTAAGAATAAAGGAAGGAAAATTCTCAAAGAGGAATTCAGATATCAAGCAGTGCCCTTTGACCAAGAGCCTTGGTTACAACATACCTACAAAAGTGAACTATCATTGAAAGACCTATGGACACTGGATTTCTCTTTCCTTATTTAGAAGGGCAGTCTGTGTCTTGGAAAAGCATACAGTTTGTTGTATCTTGCTGGACAACAGGAGTCA';
    
    if (2*$size*$count-$size-$count >= length($input)) {
        die "selection may not complete; choose a shorter length or fewer substrings, or provide a longer input string\n";
    }
    
    my @substrings;
    while (@substrings < $count) {
        my $pos = int rand(length($input)-$size+1);
        push @substrings, substr($input, $pos, $size, $mark x $size)
            if substr($input, $pos, $size) !~ /\Q$mark/;
    }