linuxbashawksedtext-manipulation

How can I select a string by reading a field name and use it to place it in a different field in bash?


I have a huge text file (thousands of lines) containing chuncks of information (previously filtered) which are separated by an empty line. Example with a few of them:

Name: CAR 8:1
Precursor type: [M]+
Formula: C15H28NO4
Num Peaks: 2
85.02841 800
286.2013 999

Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100

Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100

I would like to access to the name contained in field "Name" and add a new field named "Compound ID" behind "Formula" Expected output:

Name: CAR 8:1
Precursor type: [M]+
Formula: C15H28NO4
Compound ID: CAR 8:1
Num Peaks: 2
85.02841 800
286.2013 999

Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexCer (O-28:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100

Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexC (O-8:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100

So far I've tried:

#the script need to recieve the name of the file to be formatted
original_file=$1

#negative match to only retain fields of interest and creating a tmp  file
cat $original_file | grep -vE 'MZ|SMILES|TIME|KEY|CCS|MODE|CLASS|Comment' > formatted.txt

#getting Name index
index=($(grep -nE 'Name.+' MSDIAL-TandemMassSpectralAtlas-VS69-Pos_PQI.msp | cut -d ":" -f 1))

#getting the name 
name=($(grep -nE 'Name.+' MSDIAL-TandemMassSpectralAtlas-VS69-Pos_PQI.msp | cut -d " " -f 2,3,4))

#reading and adding the new field
for i in ${index[*]}; do
    for x in ${name[$i]}; do
        p=4
        pos=`expr $i + $p`
        sed -i "$pos i Compound ID: ${x}" formatted.txt
    done
done

In the inner for loop is giving me problems since the names are separated by an empty space so my strategy of matching the index number with the index of the names does not work.

I don't know whether there's a way to do it in bash or awk.


Solution

  • If sed is an option, you can try this

    $ sed '/Name:/{p;s/[^:]*\(.*\)/Compound ID\1/;h;d};/Formula:/{G}' input_file
    Name: CAR 8:1
    Precursor type: [M]+
    Formula: C15H28NO4
    Compound ID: CAR 8:1
    Num Peaks: 2
    85.02841 800
    286.2013 999
    
    Name: AHexCer (O-28:1)18:1;2O/17:0;O
    Precursor type: [M+H]+
    Formula: C69H131NO10
    Compound ID: AHexCer (O-28:1)18:1;2O/17:0;O
    Num Peaks: 10
    239.2375 150
    252.2691 50
    264.2691 200
    282.2797 100
    
    Name: AHexC (O-8:1)18:1;2O/17:0;O
    Precursor type: [M+H]+
    Formula: C69H131NO10
    Compound ID: AHexC (O-8:1)18:1;2O/17:0;O
    Num Peaks: 10
    239.2375 150
    252.2691 50
    264.2691 200
    282.2797 100