loopsunixsedwhile-loop

Using sed -i within a loop


I'm reformatting a big file with sample metadata. I have a file (let's call it File2) with the group each sample belong to, with one id and pop per line. My idea was to while read over that file and use sed -i to update each of the samples info. The issue is that sed is not updating the file.

The input file is a .fam file from plink, in this fashion:

pop id 0 0 0 -9
pop id 0 0 0 -9
pop id 0 0 0 -9
pop id 0 0 0 -9

Right now pop and id are the same, so I want to update the file with File2, but the sed code I normally use for this doesn't seem to work:

while read -r id pop; do sed -i 's/^$id/$pop/' File1.fam; done < File2.txt

I have tried only the sed command without iteration and it works fine. But I have 700 samples and I would dread having to do this one by one.

Why is it not working?


Solution

  • Assuming that your files are formatted as follows:

    $ cat file1.fam
    pop id1 0 0 0 -9
    pop id2 0 0 0 -9
    pop id3 0 0 0 -9
    
    $ cat file2.txt
    id3   POP003
    id2   POP002
    id1   POP001
    

    If your goal is to replace the 1st column in file1.fam with the values from the 2nd column from file2.txt using the id* values for matching, you can:

    1. Read file2.txt into a map: map[id] = pop.
    2. Iterate file1.fam and replace the 1st field with map[id] where id is taken from the 2nd field.

    E.g.,

    awk 'NR==FNR { map[$1]=$2; next } { if ($2 in map) $1 = map[$2]; print }' \
        file2.txt OFS=' ' file1.fam
    

    In the command above, awk reads the two files sequentially: file2.txt, then file1.fam. When it reads file2.txt, the number of the current record NR is equal to the current record in the current file FNR. Look at the following example for better understanding:

    awk '{print FNR, NR, $0}' file1.fam file2.txt
    1 1 pop id1 0 0 0 -9
    2 2 pop id2 0 0 0 -9
    3 3 pop id3 0 0 0 -9
    1 4 id3   POP003
    2 5 id2   POP002
    3 6 id1   POP001
    

    The NR===FNR block fills the map with the keys from the first column(IDs) and values from the second one(pop values). For the rest of the lines, the first column in(pop) is replaced with the matching value from the map (if any).

    The result is printed to the standard output. You can redirect it to a file if you wish:

    awk ... > output.txt
    

    Note that awk parses space-separated fields. If the values in your files may contain spaces, you might need to adjust the field separator(FS) or consider using other tools(e.g., Perl). But the idea will remain the same.