linuxbashloopsbioinformaticsblast

Iterate through files in a directory, create output files, linux


I am trying to iterate through every file in a specific directory (called sequences), and perform two functions on each file. I know that the functions (the 'blastp' and 'cat' lines) work, since I can run them on individual files. Ordinarily I would have a specific file name as the query, output, etc., but I'm trying to use a variable so the loop can work through many files.

(Disclaimer: I am new to coding.) I believe that I am running into serious problems with trying to use my file names within my functions. As it is, my code will execute, but it creates a bunch of extra unintended files. This is what I intend for my script to do:

Line 1: Iterate through every file in my "sequences" directory. (All of which end with ".fa", if that is helpful.)

Line 3: Recognize the filename as a variable. (I know, I know, I think I've done this horribly wrong.)

Line 4: Run the blastp function using the file name as the argument for the "query" flag, always use "database.faa" as the argument for the "db" flag, and output the result in a new file that is has the same name as the initial file, but with ".txt" at the end.

Line 5: Output parts of the output file from line 4 into a new file that has the same name as the initial file, but with "_top_hits.txt" at the end.

for sequence in ./sequences/{.,}*;
    do
            echo "$sequence";
            blastp -query $sequence -db database.faa -out ${sequence}.txt -evalue 1e-10 -outfmt 7
            cat ${sequence}.txt | awk '/hits found/{getline;print}' | grep -v "#">${sequence}_top_hits.txt
    done

When I ran this code, it gave me six new files derived from each file in the directory (and they were all in the same directory - I'd prefer to have them all in their own folders. How can I do that?). They were all empty. Their suffixes were, ".txt", ".txt.txt", ".txt_top_hits.txt", "_top_hits.txt", "_top_hits.txt.txt", and "_top_hits.txt_top_hits.txt".

If I can provide any further information to clarify anything, please let me know.


Solution

  • If you're only interested in *.fa files I would limit your input to only those matching files like this:

    for sequence in sequences/*.fa; do