I am preparing a phenotype file for a GWAS. I found this conversation helpful but it is not quite what I need to do. I have a large 44k participant txt file (containing all cohort participants). Column1=FID, Column=IID, Coumn3=pseudoID I want to create a 4th column with my phenotype of interest (1=case, 0=control, NA=all other participants). I have 2 separate text files that contain just a column with the pseudoID for my controls and antother txt file for my cases.
(1) How do i create a header for the 4th column?
(2) How do i join the pseudoID from the separate control and case txt file to create a 0 or 1 as required in the 4th column.
(3) How do the remaining empty rows in the 4th column become NA?
I will be using Regenie for the GWAS. I am more familiar with linux less so R. Any help would be appreciated. Thank you.
44k participant file txt
ppl <- data.frame(FID = 1,
IID = c(150023532, 150023457, 150075826,
150065943, 150034923),
Pseudo_ID = c("E78GJHI", "E96GH25", "E56HFT7",
"EH87HN7", "ENM8H53"))
ppl
# FID IID Pseudo_ID
# 1 1 150023532 E78GJHI
# 2 1 150023457 E96GH25
# 3 1 150075826 E56HFT7
# 4 1 150065943 EH87HN7
# 5 1 150034923 ENM8H53
Case txt
case <- c("E78GJHI", "ENM8H53")
Control txt
ctrl <- c("E96GH25", "EH87HN7")
The expected output
Phenotype File result
FID IID Pseudo_ID ICD_10
1 150023532 E78GJHI 1
1 150023457 E96GH25 0
1 150075826 E56HFT7 NA
1 150065943 EH87HN7 0
1 150034923 ENM8H53 1
Is this what you are trying to do? It might not be the most efficient one but you could do the following.
case file (add header and create $2
with status ($2=1
for cases))
awk 'BEGIN{print "Pseudo_ID","ICD_10"}; { print $1,$2=1}' OFS=" " case.txt > case_1.txt
control file (do not add header but create $2
with status ($2=0
for controls))
awk '{ print $1,$2=0}' OFS=" " control.txt > control_1.txt
Merge the two files together
cat case_1.txt control_1.txt > case_control.txt
Match case_control.txt
with the phenotype file to get the desired output
awk 'BEGIN {FS=OFS=" "} NR==FNR {a[$1]=$2;next}{print $0, ($3 in a ? a[$3]:"NA")}' case_control.txt phenotype.txt
FID IID Pseudo_ID ICD_10
1 150023532 E78GJHI 1
1 150023457 E96GH25 0
1 150075826 E56HFT7 NA
1 150065943 EH87HN7 0
1 150034923 ENM8H53 1