I have a directory with a few hundred *.fasta files, such as:
Bonobo_sp._str01_ABC784267_CDE789456.fasta
Homo_sapiens_cc21_ABC897867_CDE456789.fasta
Homo_sapiens_cc21_ABC893673_CDE753672.fasta
Gorilla_gorilla_ghjk6789_ABC736522_CDE789456.fasta
Gorilla_gorilla_ghjk6789_ABC627190_CDE891345.fasta
Gorilla_gorilla_ghjk6789_ABC117190_CDE661345.fasta
etc.
I want to concatenate files that belong to the same species, so in this case Homo_sapiens_cc21 and Gorilla_gorilla_ghjk6789.
Almost every species has different number of files that I need to concatenate.
I know that I could use a simple loop in unix/linux like:
for f in thesamename.fasta; do
cat $f >> output.fasta
done
But I don't know how to specify in a loop how should it recognize only files with the same beginning. Making that manually does not make sense at all with hundreds of files.
Does anybody have any idea how could I do that?
I will assume that the logic behind the naming is that the species are the first three words separated by underscores. I will also assume that there are no blank spaces in the filenames.
A possible strategy could be to get a list of all the species, and then concatenate all the files with that specie/prefix into a single one:
for specie in $(ls *.fasta | cut -f1-3 -d_ | sort -u)
do
cat "$specie"*.fasta > "$specie.fasta"
done
In this code, you list all the fasta files, cut the specie ID and generate an unique list of species. Then you traverse this list and, for every specie, concatenate all the files that start with that specie ID into a single file with the specie name.
More robust solutions can be written using find
and avoiding ls
, but they are more verbose and potentialy less clear:
while IFS= read -r -d '' specie
do
cat "$specie"*.fasta > "$specie.fasta"
done < <(find -maxdepth 1 -name "*.fasta" -print0 | cut -z -f2 -d/ | cut -z -f1-3 -d_ | sort -zu)