Here is an example of the df I have:
df <- data.frame(
id = c(1, 2, 3, 4, 5),
age = c(30, 45, 50, 60, 35),
sis_num = c(0,1, 2, 3, 4),
sis1_bca_status = c(0, 1, 0, 1, 0),
sis2_bca_status = c(0, 0, 1, 0, 1),
sis1_bca_age = c(0, 40, 50, 60, 70),
sis2_bca_age = c(0, 45, 55, 65, 75),
menopause_status = c(0, 1, 2, 0, 1),
meno_age = c(55, 60, 50, 55, 52),
hrt_status = c(0, 1, 2, 1, 0)
)
I need to parse through each line of this df and produce a new df. It will copy all the values for each id from this df, then based on the value sis_num (number of sisters), I will need to add seven values to the output for each id. If sis_num=0, 0 values for sisters, if sis_num=1, 7 extra values, is sis_num=2, then 14 extra values and so on so forth. So in the output, each row will a different number of columns (I need this output in a txt format to insert into another program to calculate something else, that's the only input it will take). To do this this is what I have done so far:
# Function to modify each row based on sis_num
create_row <- function(row) {
# Extract values from the row using the correct indexing
id <- row[1]
age <- row[2]
sis_num <- row[3]
sis1_bca_status <- row[4]
sis2_bca_status <- row[5]
sis1_bca_age <- row[6]
sis2_bca_age <- row[7]
menopause_status <- row[8]
meno_age <- row[9]
hrt_status <- row[10]
# Initialize the new row with id, age, sis_num
new_row <- c(id, age, sis_num)
# Add dynamic values for the first two sisters
if (sis_num == 0) {
# If sis_num is 0, return the row as it is
new_row<-c(new_row) }
else if (sis_num == 1) {
new_row <- c(new_row, sis1_bca_status, 0, 0, sis1_bca_age, -99, -99, 0) }
else if (sis_num == 2) {
new_row <- c(new_row, sis1_bca_status, 0, 0, sis1_bca_age, -99, -99, 0,
sis2_bca_status, 0, 0, sis2_bca_age, -99, -99, 0) }
else if (sis_num > 2) {
new_row <- c(new_row, sis1_bca_status, 0, 0, sis1_bca_age, -99, -99, 0,
sis2_bca_status, 0, 0, sis2_bca_age, -99, -99, 0)
# Add static values for remaining sisters (sis3, sis4, etc.)
for (i in 3:sis_num) {
new_row <- c(new_row, 0, 0, 0, -99, -99,-99, 0)
} }
# Add the remaining columns (menopause_status, meno_age, hrt_status)
new_row <- c(new_row, menopause_status, meno_age, hrt_status)
return(new_row) }
# Apply the function to each row of the dataframe
output_df <- t(apply(df, 1, create_row))
output_df2 <- do.call(rbind, output_df)
# Convert the output to a dataframe
output_df2 <- as.data.frame(output_df2)
print(output_df2)
The output I get for output_df2 is:
output_df2 <- as.data.frame(output_df2) print(output_df2)
V1
V1 1, 30, 0, 1, 55, 0
V2 2, 45, 1, 1, 0, 0, 40, -99, -99, 0, 1, 60, 1
V3 3, 50, 2, 0, 0, 0, 50, -99, -99, 0, 1, 0, 0, 55, -99, -99, 0, 2, 50, 2
V4 4, 60, 3, 1, 0, 0, 60, -99, -99, 0, 0, 0, 0, 65, -99, -99, 0, 0, 0, 0, -99, -99, -99, 0, 0, 55, 1
V5 5, 35, 4, 0, 0, 0, 70, -99, -99, 0, 1, 0, 0, 75, -99, -99, 0, 0, 0, 0, -99, -99, -99, 0, 0, 0, 0, -99, -99, -99, 0, 1, 52, 0
> write.table(output_df2, file = "output_df.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
Error in write.table(output_df2, file = "output_df.txt", sep = "\t", row.names = FALSE, :
unimplemented type 'list' in 'EncodeElement'
I want this output in a txt file without the V1, V2, v3. Each ID will have a new row with its values and the number of columns for each row(or ID) is dynamic depending on sis_num. To fix the list issues it, I tried the previous solutions shared here:
> output_df2 <- apply(output_df,2,as.character)
write.csv(output_df2,file = "output_df.csv")
The output looks like this in csv, but this isn't what I want: enter image description here How do I fix this?
In real life, I have 80 variables in the input file. Is there anyway I can copy them into the new dataframe faster rather than indexing each value
I do not really understand your function. Here is collected what I understand:
f = \(r) {
i = r[c('id', 'age', 'sis_num')]
j = r[c('menopause_status', 'meno_age', 'hrt_status')]
if (r['sis_num'] == 0) {
c(i, j)
} else if (r['sis_num'] == 1) {
c(i, r['sis1_bca_status'], 0, 0, r['sis1_bca_age'], -99, -99, 0, j)
} else {
c(i, r['sis1_bca_status'], 0, 0, r['sis1_bca_age'], -99, -99, 0,
r['sis2_bca_status'], 0, 0, r['sis2_bca_age'], -99, 0, j)
}
}
Please add back in what is missing. Applying f
row-wisely:
l = apply(df, 1, \(x) toString(f(x)))
l
is a list which looks like
> l
[1] "1, 30, 0, 0, 55, 0"
[2] "2, 45, 1, 1, 0, 0, 40, -99, -99, 0, 1, 60, 1"
[3] "3, 50, 2, 0, 0, 0, 50, -99, -99, 0, 1, 0, 0, 55, -99, 0, 2, 50, 2"
[4] "4, 60, 3, 1, 0, 0, 60, -99, -99, 0, 0, 0, 0, 65, -99, 0, 0, 55, 1"
[5] "5, 35, 4, 0, 0, 0, 70, -99, -99, 0, 1, 0, 0, 75, -99, 0, 1, 52, 0"
We can use cat()
to write l
as csv file called test.csv
to disk.
cat(l, sep='\n', file='test.csv')
It has been stored in the working directory. Type getwd()
to see where.