rcreate-tablemedianqq

R: How to modify a CreateTable() function for reiterated observations and with the wrongs index?


I'm trying creating a table on the following dataset which I'm reporting here the very first fifty observations. Here following it is reported the dataset I'm working on.

enter link description here

There are some typos for age and gnder variable that I susggest to fix as follows:

colnames(d)[8] <- 'COND'
d$gender = ifelse(tolower(substr(d$gender,1,1)) == "k", "F", "M") 
library(libr)
d <- datastep(d, {
  if (is.na(age)) {
    age <- 21
  }}
)

I'm trying to create a summary table by using the following code:

CreateTableOne(
  vars = c('TASK', 'COND', 't1.key', 'T1.response', 'age', 'T1.ACC'), 
  strata = c('ID'),
  factorVars = c('gender'), 
  argsApprox = list(correct = FALSE), 
  smd = TRUE, 
  addOverall = TRUE, 
  test = TRUE) %>% 
  na.omit() %>% 
  kableone()

obtaning this table

enter image description here

However how you see from this function, as I have many observation for the same subject, I count just 54 IDs and therefore the number of females and males is incorrect.

length(unique(d$ID)) 
[1] 54

Anyone knows how to fix it? And furthermore as the 'age' and 'T1.ACC' have non-normal distribution anyone knows how I could replace them with median and Q1 and Q3, for example?


Solution

  • I would like to help you. However, there are the following problems with the data you provide:

    1. The variable COND is missing
    2. Only one unique value of the TASK variable (the CreateTableOne function does not accept variables with one unique value).
    3. Only one unique value for the variable age.
    4. The variable ID is repeated several times.

    However, even without changing your data, you can see what your problem is. If you have data in this form, you cannot use CreateTableOne! This is because it counts every occurrence of the value m and every occurrence of the value k. And since you have multiple entries for one person, the CreateTableOne function will count each occurrence separately.

    Please take a look at the solution I have proposed here How to describe unique values of grouped observations for several vars?.

    Update 1

    OKAY. Let's try to face your data. You have 54 patients with different IDs.

    data_Confidence_in_Action %>% distinct(ID) %>% nrow()
    #[1] 54
    

    However, note that one ID appears to be incorrect.

    data_Confidence_in_Action %>% distinct(ID) %>%
      mutate(lenID = str_length(ID)) %>% filter(lenID!=5)
    #  A tibble: 1 x 2
    #  ID         lenID
    #  <chr>      <int>
    #1 P1419 dots    10
    

    However, we can leave it as it is. Correct it yourself if you have to. However, remember that you have as many as 8 different genders. Be careful because in our country the gender ideology is not well received ;-)

    data_Confidence_in_Action %>% distinct(gender)
    #  A tibble: 8 x 1
    #  gender     
    #  <chr>      
    #1 k          
    #2 kobieta    
    #3 M          
    #4 K          
    #5 m¦Ö+-czyzna
    #6 21         
    #7 m          
    #8 M¦Ö+-czyzna
    

    This, unfortunately, needs to be fixed. Unfortunately, patient P1440 was assigned age by gender. So what is the gender of the P1440?

    data_Confidence_in_Action %>% filter(gender==21) %>% distinct(ID, gender, age)
    #  A tibble: 1 x 3
    #  ID    gender   age
    #  <chr> <chr>  <dbl>
    #1 P1440 21        NA
    
    data_Confidence_in_Action %>% distinct(ID, gender) %>% 
      group_by(gender) %>% summarise(n = n())
    #  A tibble: 8 x 2
    #  gender          n
    #  <chr>       <int>
    #1 21              1
    #2 k              36
    #3 K               3
    #4 kobieta         9
    #5 m               1
    #6 M               1
    #7 m¦Ö+-czyzna     2
    #8 M¦Ö+-czyzna     1
    

    As you can see, you have more women. So let P1440 be a woman. Will be OK?

    Finally, notice that the two variables have inconvenient names. It is about Condition (whether a person responded) and Go / Nogo (whether a person should respond).

    Let's fix it all in one go.

    data_Confidence_in_Action = data_Confidence_in_Action %>% 
      mutate(
        gender = ifelse(str_detect(gender, "[k,K,21]"),"k","m"),
        age = ifelse(is.na(age), 21, age)
      ) %>% rename(Condition=`Condition (whether a person responded)`, 
                   Go.Nogo = `Go/Nogo (whether a person should respond)`)
    

    Finally, let's change some of the variables from chr to factor, but don't replace the correct levels. I hope I took it wisely.

    data_Confidence_in_Action = data_Confidence_in_Action %>% 
      mutate(
        ID = ID %>% fct_inorder(),
        gender = gender %>% fct_infreq(),
        t1.key = t1.key %>% fct_infreq(),
        Condition = Condition %>% fct_infreq(),
        CR.key = CR.key %>% fct_infreq(),
        TASK = TASK %>% fct_infreq(),
        Go.Nogo = Go.Nogo %>% fct_infreq(),
        difficulty = difficulty %>% factor(c("easy", "medium", "hard"))
      )
    

    With the data organized in such a way, let's get to the heart of the problem. What do you really want to analyze. Note that for variables such as TASK, Condition, and t1.key, there are both valid values for each applicant.

    data_Confidence_in_Action %>% group_by(ID) %>% summarise(
      nunique.TASK = length(unique(TASK)),
      nunique.Condition = length(unique(Condition)),
      nunique.t1.key = length(unique(t1.key))
    ) %>% distinct(nunique.TASK, nunique.Condition, nunique.t1.key)
    #  A tibble: 1 x 3
    #  nunique.TASK nunique.Condition nunique.t1.key
    #         <int>             <int>          <int>
    #1            2                 2              2
    

    However, if we look at the proportions of the occurrence of different values in these variables, they are different in each patient.

    data_Confidence_in_Action %>% group_by(ID) %>% summarise(
      prop.TASK = sum(TASK=="left")/sum(TASK=="right")) %>% 
      distinct()
    
    data_Confidence_in_Action %>% group_by(ID) %>% summarise(
      prop.Condition = sum(Condition=="NR")/sum(Condition=="R"))%>% 
      distinct()
    
    data_Confidence_in_Action %>% group_by(ID) %>% summarise(
      prop.t1.key = sum(t1.key=="None")/sum(t1.key=="space"))%>% 
      distinct()
    

    So write clearly what and how you want to summarize because it is not clear to me what you want to get.

    Update 2

    OKAY. I can see that you are beginning to understand something. Still, I don't know what you want to sum up. Look below. First, let's collect all the code to prepare the data

    library(tidyverse)
    library(readxl)
    library(tableone)
    data_Confidence_in_Action <- read_excel("data_Confidence in Action.xlsx")
    
    data_Confidence_in_Action = data_Confidence_in_Action %>%
      mutate(
        gender = ifelse(str_detect(gender, "[k,K,21]"),"k","m"),
        age = ifelse(is.na(age), 21, age)
      ) %>% rename(Condition=`Condition (whether a person responded)`,
                   Go.Nogo = `Go/Nogo (whether a person should respond)`)
    
    data_Confidence_in_Action = data_Confidence_in_Action %>%
      mutate(
        ID = ID %>% fct_inorder(),
        gender = gender %>% fct_infreq(),
        t1.key = t1.key %>% fct_infreq(),
        Condition = Condition %>% fct_infreq(),
        CR.key = CR.key %>% fct_infreq(),
        TASK = TASK %>% fct_infreq(),
        Go.Nogo = Go.Nogo %>% fct_infreq(),
        difficulty = difficulty %>% factor(c("easy", "medium", "hard"))
      )
    

    And now the summary. If we do this:

    CreateTableOne(
      data = data_Confidence_in_Action,
      vars = c('TASK', 'Condition', 't1.key', 'T1.response', 'age', 'T1.ACC'), 
      strata = 'gender',
      factorVars = c('TASK', 'Condition', 't1.key'), 
      argsApprox = list(correct = FALSE), 
      smd = TRUE, 
      addOverall = TRUE, 
      test = TRUE) %>% 
      kableone()
    

    output

    |                        |Overall      |k            |m            |p      |test |
    |:-----------------------|:------------|:------------|:------------|:------|:----|
    |n                       |41713        |37823        |3890         |       |     |
    |TASK = right (%)        |20832 (49.9) |18889 (49.9) |1943 (49.9)  |0.992  |     |
    |Condition = R (%)       |20033 (48.0) |18130 (47.9) |1903 (48.9)  |0.241  |     |
    |t1.key = space (%)      |20033 (48.0) |18130 (47.9) |1903 (48.9)  |0.241  |     |
    |T1.response (mean (SD)) |0.48 (0.50)  |0.48 (0.50)  |0.49 (0.50)  |0.241  |     |
    |age (mean (SD))         |20.74 (2.67) |20.75 (2.70) |20.60 (2.33) |0.001  |     |
    |T1.ACC (mean (SD))      |0.70 (0.46)  |0.70 (0.46)  |0.73 (0.45)  |<0.001 |     |
    

    we get a summary for all observations that is n == 41713. And since there are many observations for each patient, such a summary is of little use. At least I think so. However, we can summarize for a few selected patients.

    CreateTableOne(
      data = data_Confidence_in_Action %>% 
        filter(ID %in% c('P1323', 'P1403', 'P1404')) %>% 
        mutate(ID = ID %>% fct_drop()),
      vars = c('TASK', 'Condition', 't1.key', 'T1.response', 'age', 'T1.ACC'), 
      strata = c('ID'),
      factorVars = c('TASK', 'Condition', 't1.key'), 
      argsApprox = list(correct = FALSE), 
      smd = TRUE, 
      addOverall = TRUE, 
      test = TRUE) %>% 
      kableone()
    

    output

    |                        |Overall      |P1323        |P1403        |P1404        |p      |test |
    |:-----------------------|:------------|:------------|:------------|:------------|:------|:----|
    |n                       |2323         |775          |776          |772          |       |     |
    |TASK = right (%)        |1164 (50.1)  |390 (50.3)   |386 (49.7)   |388 (50.3)   |0.969  |     |
    |Condition = R (%)       |1168 (50.3)  |385 (49.7)   |435 (56.1)   |348 (45.1)   |<0.001 |     |
    |t1.key = space (%)      |1168 (50.3)  |385 (49.7)   |435 (56.1)   |348 (45.1)   |<0.001 |     |
    |T1.response (mean (SD)) |0.50 (0.50)  |0.50 (0.50)  |0.56 (0.50)  |0.45 (0.50)  |<0.001 |     |
    |age (mean (SD))         |19.66 (0.94) |19.00 (0.00) |19.00 (0.00) |21.00 (0.00) |<0.001 |     |
    |T1.ACC (mean (SD))      |0.70 (0.46)  |0.67 (0.47)  |0.77 (0.42)  |0.65 (0.48)  |<0.001 |     |
    

    This makes more sense now, but is separate for each patient.

    Alternatively, you can do this summary without using CreateTableOne, e.g. yes

    data_Confidence_in_Action %>% group_by(gender, ID) %>% 
      summarise(
        age = min(age)) %>% group_by(gender) %>% 
      summarise(
        n = n(),
        Min = min(age),
        Q1 = quantile(age,1/4,8),
        mean = mean(age),
        median = median(age),
        Q3 = quantile(age,3/4,8),
        Max = max(age),
        IQR = IQR(age),
        Kurt = e1071::kurtosis(age),
        skew = e1071::skewness(age),
        SD = sd(age))
    

    output

    # A tibble: 2 x 12
      gender     n   Min    Q1  mean median    Q3   Max   IQR  Kurt  skew    SD
      <fct>  <int> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    1 k         49    19    19  20.8     20    21    32     2  7.47 2.79   2.73
    2 m          5    19    19  20.6     19    21    25     2 -1.29 0.823  2.61
    

    Think carefully and write down what you really expect. Unless, of course, this topic is still interesting for you.