bashsedcommand-linegnu-toolchain

bash script - Use patterns list in sed to remove substrings


I have this file (adapters.txt) with a list of patterns:

cactctttccctacacgacgctcttccg
cactctttccctacacgacgctcttccgaatcta
cactctttccctacacgacgctcttccgaatctaatt
cactctttccctacacgacgctcttccgaatctaatta
cactctttccctacacgacgctcttccgaatctag
cactctttccctacacgacgctcttccgaatctagc
cactctttccctacacgacgctcttccgacctcattcc
cactctttccctacacgacgctcttccgacctcattcccaccctcttccg
cactctttccctacacgacgctcttccgatc
cactctttccctacacgacgctcttccgatccaatt
cactctttccctacacgacgctcttccgatttagc
cactctttccctacacgacgctcttccgatttagct
cactctttccctacacgacgctcttccgatttcattc
cactctttccctacacgacgctcttccgatttcattcttcccc
cactctttccctacacgacgctcttccgattttatttc
cactctttccctacacgacgctcttccggatcta
cactctttccctacacgacgctcttccggatctaatt
cactctttccctacacgacgctcttccggatctaattc
cactctttccctacacgacgctcttccggatctaattca
cactctttccctacacgacgctcttccggatctagctt
cactctttccctacacgacgctcttccggttcta
cactctttccctacacgacgctttccgatcta
cactctttccctacacgacgctttccgatctaattc
cactctttccctacacgacgtcttccgatctaattctggaccatagtgcaatgt
cactctttccctacacgcgctcttccgatcta
cactctttccctacacgcgctcttccgatctaattcg
cactctttccctacacgcgctcttccgatctaattcgg
cactctttccctacacgcgctcttccgatctaattcggcgg
cactctttccctacacgcgctcttccgatctagct
cactctttccctaccgacgctcttccgatcta
cactctttccctacacgacg

I need find and remove these patterns from "sequences.fasta" file:

>seq01
cactctttccctacacgacgctcttccgWANTEDSEQUENCE
>seq01
cactctttccctacacgacgctcttccgaatctaWANTEDSEQUENCE
>seq03
cactctttccctacacgacgctcttccgaatctaattWANTEDSEQUENCE
>seq04
cactctttccctacacgacgctcttccgaatctaattaWANTEDSEQUENCE
>seq05
cactctttccctacacgcgctcttccgatctaattcggWANTEDSEQUENCE
>seq06
cactctttccctacacgcgctcttccgatctaattcggcggWANTEDSEQUENCE
>seq07
cactctttccctacacgcgctcttccgatctagctWANTEDSEQUENCE
>seq08
cactctttccctaccgacgctcttccgatctaWANTEDSEQUENCE

So the wanted output should be:

>seq01
WANTEDSEQUENCE
>seq02
WANTEDSEQUENCE
>seq03
WANTEDSEQUENCE
>seq04
WANTEDSEQUENCE
>seq05
WANTEDSEQUENCE
>seq06
WANTEDSEQUENCE
>seq07
WANTEDSEQUENCE
>seq08
WANTEDSEQUENCE

(Just for the sake of the example I've used "WANTEDSEQUENCE" instead of the real sequences)

I've tried the following (and some variations. I've also tried a while read):

ADAPS=($(cat adapters.txt))
FASTA="sequences.fasta"


for ADAP in "${ADAPS[@]}";
do
    sed "s/${ADAP}//g" "${FASTA}" > output.fasta
done

But I got this:

>seq01
ctcttccgWANTEDSEQUENCE
>seq01
ctcttccgaatctaWANTEDSEQUENCE
>seq03
ctcttccgaatctaattWANTEDSEQUENCE
>seq04
ctcttccgaatctaattaWANTEDSEQUENCE
>seq05
cactctttccctacacgcgctcttccgatctaattcggWANTEDSEQUENCE
>seq06
cactctttccctacacgcgctcttccgatctaattcggcggWANTEDSEQUENCE
>seq07
cactctttccctacacgcgctcttccgatctagctWANTEDSEQUENCE
>seq08
cactctttccctaccgacgctcttccgatctaWANTEDSEQUENCE

How can I solve this?


Solution

  • Sort adapters.txt in reverse order by its line length, create a sed script from its output and use it with bash's command substitution <(...) with a second sed to apply it to sequences.fasta:

    sed -f <(awk '{ print length, $0 }' adapters.txt | sort -rn | cut -d" " -f2- | sed -E 's/(.*)/s|&||/') sequences.fasta
    

    Output:

    >seq01
    WANTEDSEQUENCE
    >seq01
    WANTEDSEQUENCE
    >seq03
    WANTEDSEQUENCE
    >seq04
    WANTEDSEQUENCE
    >seq05
    WANTEDSEQUENCE
    >seq06
    WANTEDSEQUENCE
    >seq07
    WANTEDSEQUENCE
    >seq08
    WANTEDSEQUENCE
    

    The sorting of adapters.txt is necessary because it contains substrings from other strings in the same file.

    Same code in multiple lines and files:

    awk '{ print length, $0 }' adapters.txt | sort -rn | cut -d" " -f2- > adapters_sorted.txt
    sed -E 's/(.*)/s|&||/' adapters_sorted.txt > sed.script
    sed -f sed.script sequences.fasta