I have a 100 of FASTA containing protein sequences stored in a singe directory. I need to add their respective file names to each of the FASTA headers (character string strings starting with ">") containd within them and subsequently merge them into a single .faa file.
I got the merging part going with the following PowerShell commands:
#Change extensions from .faa to .txt
gci -File | Rename-Item -NewName { $_.name -replace ".faa", ".txt" }
#Actual merging
Get-ChildItem $directory -include *.txt -rec | ForEach-Object {gc $_; ""} | out-file $directory
#Change encoding so I can process the file further in R
Get-Content .\test.txt | Set-Content -Encoding utf8 test-utf8.txt
After that I just change the extension back to .faa.
Each file stores multiple sequences of proteins. Each header should look like this:
some_sequence -> >some_sequence file_name
This is my first contact with PowerShell, how can I do this? Best regards!
I assume you're looking for something like the following, which uses a switch
statement to process the individual files and modifies their headers:
Get-ChildItem $directory -Filter *.faa -Recurse |
ForEach-Object {
$file = $_
switch -Regex -File $file.FullName { # Process the file at hand.
'^>' { $_ + ' ' + $file.Name } # header line -> append file name
default { $_ } # pass through
}
'' # Empty line between the content from the indiv. files.
} |
Set-Content -Encoding utf8 test-utf8.txt
Note:
.faa
files first.Set-Content
call.