regexloopsunixbioinformatics

How to concatenate files that have the same beginning of a name?


I have a directory with a few hundred *.fasta files, such as:

Bonobo_sp._str01_ABC784267_CDE789456.fasta
Homo_sapiens_cc21_ABC897867_CDE456789.fasta
Homo_sapiens_cc21_ABC893673_CDE753672.fasta 
Gorilla_gorilla_ghjk6789_ABC736522_CDE789456.fasta
Gorilla_gorilla_ghjk6789_ABC627190_CDE891345.fasta
Gorilla_gorilla_ghjk6789_ABC117190_CDE661345.fasta

etc.

I want to concatenate files that belong to the same species, so in this case Homo_sapiens_cc21 and Gorilla_gorilla_ghjk6789.

Almost every species has different number of files that I need to concatenate.

I know that I could use a simple loop in unix/linux like:

    for f in thesamename.fasta; do
        cat $f >> output.fasta
    done

But I don't know how to specify in a loop how should it recognize only files with the same beginning. Making that manually does not make sense at all with hundreds of files.

Does anybody have any idea how could I do that?


Solution

  • I will assume that the logic behind the naming is that the species are the first three words separated by underscores. I will also assume that there are no blank spaces in the filenames.

    A possible strategy could be to get a list of all the species, and then concatenate all the files with that specie/prefix into a single one:

    for specie in $(ls *.fasta | cut -f1-3 -d_ | sort -u)
    do
        cat "$specie"*.fasta > "$specie.fasta"
    done
    

    In this code, you list all the fasta files, cut the specie ID and generate an unique list of species. Then you traverse this list and, for every specie, concatenate all the files that start with that specie ID into a single file with the specie name.

    More robust solutions can be written using find and avoiding ls, but they are more verbose and potentialy less clear:

    while IFS= read -r -d '' specie
    do
        cat "$specie"*.fasta > "$specie.fasta"
    done < <(find -maxdepth 1 -name "*.fasta" -print0 | cut -z -f2 -d/ | cut -z -f1-3 -d_ | sort -zu)