unixawktext

Awk matching patterns and removing adjacent lines


I've got the volumetric data from different brain regions and I'm trying to sort it out to make the analysis easier. To get an idea this is a part of what I've got:

LT_Putamen 5075 5075.000000
LT_Temporal 84593 84593.000000
LT_Thalamus 7720 7720.000000
RT_Accumbens 623 623.000000
RT_Accumbens overlaps 64.000000 10.2700
RT_Amygdala 2252 2252.000000
RT_Amygdala overlaps 2133.000000 94.7100

I want modify it and the output would be:

LT_Putamen 5075 5075.000000
LT_Putamen overlaps 0 0
LT_Temporal 84593 84593.000000
LT_Temporal overlaps 0 0
LT_Thalamus 7720 7720.000000
LT_Thalamus overlaps 0 0
RT_Accumbens 623 623.000000
RT_Accumbens overlaps 64.000000 10.2700
RT_Amygdala 2252 2252.000000
RT_Amygdala overlaps 2133.000000 94.7100

Just want to have this "overlaps" line in each record.

I'm rather a newbie in programming but I came up with something like that:

awk '{
    if (NR == 1) {
        # Initialize the first region (using first world in a line)
        region = $1
        print $0
    } else {
        if ($1 != region) {
            # Finalize the old region - printing "overlaps" line with 0 0
            printf("%s %overlaps 0 0\n", region)
            # Start the new region
            region = $1
        }
        # Print the current line (for the current region)
        print $0

    }
}
END {
    # For the last region
    if (region) {
        printf("%s 0 0\n", region)
    }
}'

The outcome is close to what I want:

LT_Putamen 5075 5075.000000
LT_Putamen overlaps 0 0
LT_Temporal 84593 84593.000000
LT_Temporal overlaps 0 0
LT_Thalamus 7720 7720.000000
LT_Thalamus overlaps 0 0
RT_Accumbens 623 623.000000
RT_Accumbens overlaps 0 0
RT_Accumbens overlaps 64.000000 10.2700
RT_Amygdala 2252 2252.000000
RT_Amygdala overlaps 0 0
RT_Amygdala overlaps 2133.000000 94.7100

But I've these extra "overlaps" lines in regions which already had it. Could you please help me? What should I do to make it work? I'd be very grateful for any help!! Thanks

Marcin


Solution

  • Assumptions/Understandings:

    One awk idea:

    awk '
        { if ($1 != prev && NR > 1 && ! overlaps)       # if different $1 and previous line did not contain string "overlaps" then ...    
             print prev,"overlaps",0,0                  # print new line
          overlaps = ($2 == "overlaps" ? 1 : 0)         # set flag
          prev = $1                                     # save current $1
        }
    1                                                   # print current line
    END { if (! overlaps)                               # if last line of file did not contain string "overlaps" then ...  
             print prev,"overlaps",0,0                  # print new line
        }
    ' volume.dat
    

    This generates:

    LT_Putamen 5075 5075.000000
    LT_Putamen overlaps 0 0
    LT_Temporal 84593 84593.000000
    LT_Temporal overlaps 0 0
    LT_Thalamus 7720 7720.000000
    LT_Thalamus overlaps 0 0
    RT_Accumbens 623 623.000000
    RT_Accumbens overlaps 64.000000 10.2700
    RT_Amygdala 2252 2252.000000
    RT_Amygdala overlaps 2133.000000 94.7100
    

    To demonstrate correct processing where the last line is not an "overlaps" line:

    Setup:

    $ cat volume.dat
    LT_Putamen 5075 5075.000000
    LT_Temporal 84593 84593.000000
    LT_Thalamus 7720 7720.000000
    RT_Accumbens 623 623.000000
    RT_Accumbens overlaps 64.000000 10.2700
    RT_Amygdala 2252 2252.000000
    RT_Amygdala overlaps 2133.000000 94.7100
    XX_Last_Line 1234 6789.00000
    

    The same code generates:

    LT_Putamen 5075 5075.000000
    LT_Putamen overlaps 0 0
    LT_Temporal 84593 84593.000000
    LT_Temporal overlaps 0 0
    LT_Thalamus 7720 7720.000000
    LT_Thalamus overlaps 0 0
    RT_Accumbens 623 623.000000
    RT_Accumbens overlaps 64.000000 10.2700
    RT_Amygdala 2252 2252.000000
    RT_Amygdala overlaps 2133.000000 94.7100
    XX_Last_Line 1234 6789.00000
    XX_Last_Line overlaps 0 0