I'm reformatting a big file with sample metadata. I have a file (let's call it File2) with the group each sample belong to, with one id and pop per line. My idea was to while read over that file and use sed -i
to update each of the samples info. The issue is that sed is not updating the file.
The input file is a .fam
file from plink, in this fashion:
pop id 0 0 0 -9
pop id 0 0 0 -9
pop id 0 0 0 -9
pop id 0 0 0 -9
Right now pop and id are the same, so I want to update the file with File2, but the sed code I normally use for this doesn't seem to work:
while read -r id pop; do sed -i 's/^$id/$pop/' File1.fam; done < File2.txt
I have tried only the sed command without iteration and it works fine. But I have 700 samples and I would dread having to do this one by one.
Why is it not working?
Assuming that your files are formatted as follows:
$ cat file1.fam
pop id1 0 0 0 -9
pop id2 0 0 0 -9
pop id3 0 0 0 -9
$ cat file2.txt
id3 POP003
id2 POP002
id1 POP001
If your goal is to replace the 1st column in file1.fam
with the values from the 2nd column from file2.txt
using the id*
values for matching, you can:
file2.txt
into a map: map[id] = pop
.file1.fam
and replace the 1st field with map[id]
where id
is taken from the 2nd field.E.g.,
awk 'NR==FNR { map[$1]=$2; next } { if ($2 in map) $1 = map[$2]; print }' \
file2.txt OFS=' ' file1.fam
In the command above, awk
reads the two files sequentially: file2.txt
, then file1.fam
. When it reads file2.txt
, the number of the current record NR
is equal to the current record in the current file FNR
. Look at the following example for better understanding:
awk '{print FNR, NR, $0}' file1.fam file2.txt
1 1 pop id1 0 0 0 -9
2 2 pop id2 0 0 0 -9
3 3 pop id3 0 0 0 -9
1 4 id3 POP003
2 5 id2 POP002
3 6 id1 POP001
The NR===FNR
block fills the map with the keys from the first column(IDs) and values from the second one(pop values). For the rest of the lines, the first column in(pop) is replaced with the matching value from the map (if any).
The result is printed to the standard output. You can redirect it to a file if you wish:
awk ... > output.txt
Note that awk parses space-separated fields. If the values in your files may contain spaces, you might need to adjust the field separator(FS
) or consider using other tools(e.g., Perl). But the idea will remain the same.