I'm trying creating a table on the following dataset which I'm reporting here the very first fifty observations. Here following it is reported the dataset I'm working on.
There are some typos for age and gnder variable that I susggest to fix as follows:
colnames(d)[8] <- 'COND'
d$gender = ifelse(tolower(substr(d$gender,1,1)) == "k", "F", "M")
library(libr)
d <- datastep(d, {
if (is.na(age)) {
age <- 21
}}
)
I'm trying to create a summary table by using the following code:
CreateTableOne(
vars = c('TASK', 'COND', 't1.key', 'T1.response', 'age', 'T1.ACC'),
strata = c('ID'),
factorVars = c('gender'),
argsApprox = list(correct = FALSE),
smd = TRUE,
addOverall = TRUE,
test = TRUE) %>%
na.omit() %>%
kableone()
obtaning this table
However how you see from this function, as I have many observation for the same subject, I count just 54 IDs and therefore the number of females and males is incorrect.
length(unique(d$ID))
[1] 54
Anyone knows how to fix it? And furthermore as the 'age' and 'T1.ACC' have non-normal distribution anyone knows how I could replace them with median and Q1 and Q3, for example?
I would like to help you. However, there are the following problems with the data you provide:
COND
is missingTASK
variable (the CreateTableOne
function does not accept variables with one unique value).age
.ID
is repeated several times.However, even without changing your data, you can see what your problem is. If you have data in this form, you cannot use CreateTableOne
! This is because it counts every occurrence of the value m
and every occurrence of the value k
. And since you have multiple entries for one person, the CreateTableOne
function will count each occurrence separately.
Please take a look at the solution I have proposed here How to describe unique values of grouped observations for several vars?.
Update 1
OKAY. Let's try to face your data. You have 54 patients with different IDs.
data_Confidence_in_Action %>% distinct(ID) %>% nrow()
#[1] 54
However, note that one ID appears to be incorrect.
data_Confidence_in_Action %>% distinct(ID) %>%
mutate(lenID = str_length(ID)) %>% filter(lenID!=5)
# A tibble: 1 x 2
# ID lenID
# <chr> <int>
#1 P1419 dots 10
However, we can leave it as it is. Correct it yourself if you have to. However, remember that you have as many as 8 different genders. Be careful because in our country the gender ideology is not well received ;-)
data_Confidence_in_Action %>% distinct(gender)
# A tibble: 8 x 1
# gender
# <chr>
#1 k
#2 kobieta
#3 M
#4 K
#5 m¦Ö+-czyzna
#6 21
#7 m
#8 M¦Ö+-czyzna
This, unfortunately, needs to be fixed. Unfortunately, patient P1440 was assigned age by gender. So what is the gender of the P1440?
data_Confidence_in_Action %>% filter(gender==21) %>% distinct(ID, gender, age)
# A tibble: 1 x 3
# ID gender age
# <chr> <chr> <dbl>
#1 P1440 21 NA
data_Confidence_in_Action %>% distinct(ID, gender) %>%
group_by(gender) %>% summarise(n = n())
# A tibble: 8 x 2
# gender n
# <chr> <int>
#1 21 1
#2 k 36
#3 K 3
#4 kobieta 9
#5 m 1
#6 M 1
#7 m¦Ö+-czyzna 2
#8 M¦Ö+-czyzna 1
As you can see, you have more women. So let P1440 be a woman. Will be OK?
Finally, notice that the two variables have inconvenient names. It is about Condition (whether a person responded)
and Go / Nogo (whether a person should respond)
.
Let's fix it all in one go.
data_Confidence_in_Action = data_Confidence_in_Action %>%
mutate(
gender = ifelse(str_detect(gender, "[k,K,21]"),"k","m"),
age = ifelse(is.na(age), 21, age)
) %>% rename(Condition=`Condition (whether a person responded)`,
Go.Nogo = `Go/Nogo (whether a person should respond)`)
Finally, let's change some of the variables from chr
to factor
, but don't replace the correct levels. I hope I took it wisely.
data_Confidence_in_Action = data_Confidence_in_Action %>%
mutate(
ID = ID %>% fct_inorder(),
gender = gender %>% fct_infreq(),
t1.key = t1.key %>% fct_infreq(),
Condition = Condition %>% fct_infreq(),
CR.key = CR.key %>% fct_infreq(),
TASK = TASK %>% fct_infreq(),
Go.Nogo = Go.Nogo %>% fct_infreq(),
difficulty = difficulty %>% factor(c("easy", "medium", "hard"))
)
With the data organized in such a way, let's get to the heart of the problem. What do you really want to analyze. Note that for variables such as TASK
, Condition
, and t1.key
, there are both valid values for each applicant.
data_Confidence_in_Action %>% group_by(ID) %>% summarise(
nunique.TASK = length(unique(TASK)),
nunique.Condition = length(unique(Condition)),
nunique.t1.key = length(unique(t1.key))
) %>% distinct(nunique.TASK, nunique.Condition, nunique.t1.key)
# A tibble: 1 x 3
# nunique.TASK nunique.Condition nunique.t1.key
# <int> <int> <int>
#1 2 2 2
However, if we look at the proportions of the occurrence of different values in these variables, they are different in each patient.
data_Confidence_in_Action %>% group_by(ID) %>% summarise(
prop.TASK = sum(TASK=="left")/sum(TASK=="right")) %>%
distinct()
data_Confidence_in_Action %>% group_by(ID) %>% summarise(
prop.Condition = sum(Condition=="NR")/sum(Condition=="R"))%>%
distinct()
data_Confidence_in_Action %>% group_by(ID) %>% summarise(
prop.t1.key = sum(t1.key=="None")/sum(t1.key=="space"))%>%
distinct()
So write clearly what and how you want to summarize because it is not clear to me what you want to get.
Update 2
OKAY. I can see that you are beginning to understand something. Still, I don't know what you want to sum up. Look below. First, let's collect all the code to prepare the data
library(tidyverse)
library(readxl)
library(tableone)
data_Confidence_in_Action <- read_excel("data_Confidence in Action.xlsx")
data_Confidence_in_Action = data_Confidence_in_Action %>%
mutate(
gender = ifelse(str_detect(gender, "[k,K,21]"),"k","m"),
age = ifelse(is.na(age), 21, age)
) %>% rename(Condition=`Condition (whether a person responded)`,
Go.Nogo = `Go/Nogo (whether a person should respond)`)
data_Confidence_in_Action = data_Confidence_in_Action %>%
mutate(
ID = ID %>% fct_inorder(),
gender = gender %>% fct_infreq(),
t1.key = t1.key %>% fct_infreq(),
Condition = Condition %>% fct_infreq(),
CR.key = CR.key %>% fct_infreq(),
TASK = TASK %>% fct_infreq(),
Go.Nogo = Go.Nogo %>% fct_infreq(),
difficulty = difficulty %>% factor(c("easy", "medium", "hard"))
)
And now the summary. If we do this:
CreateTableOne(
data = data_Confidence_in_Action,
vars = c('TASK', 'Condition', 't1.key', 'T1.response', 'age', 'T1.ACC'),
strata = 'gender',
factorVars = c('TASK', 'Condition', 't1.key'),
argsApprox = list(correct = FALSE),
smd = TRUE,
addOverall = TRUE,
test = TRUE) %>%
kableone()
output
| |Overall |k |m |p |test |
|:-----------------------|:------------|:------------|:------------|:------|:----|
|n |41713 |37823 |3890 | | |
|TASK = right (%) |20832 (49.9) |18889 (49.9) |1943 (49.9) |0.992 | |
|Condition = R (%) |20033 (48.0) |18130 (47.9) |1903 (48.9) |0.241 | |
|t1.key = space (%) |20033 (48.0) |18130 (47.9) |1903 (48.9) |0.241 | |
|T1.response (mean (SD)) |0.48 (0.50) |0.48 (0.50) |0.49 (0.50) |0.241 | |
|age (mean (SD)) |20.74 (2.67) |20.75 (2.70) |20.60 (2.33) |0.001 | |
|T1.ACC (mean (SD)) |0.70 (0.46) |0.70 (0.46) |0.73 (0.45) |<0.001 | |
we get a summary for all observations that is n == 41713
. And since there are many observations for each patient, such a summary is of little use. At least I think so.
However, we can summarize for a few selected patients.
CreateTableOne(
data = data_Confidence_in_Action %>%
filter(ID %in% c('P1323', 'P1403', 'P1404')) %>%
mutate(ID = ID %>% fct_drop()),
vars = c('TASK', 'Condition', 't1.key', 'T1.response', 'age', 'T1.ACC'),
strata = c('ID'),
factorVars = c('TASK', 'Condition', 't1.key'),
argsApprox = list(correct = FALSE),
smd = TRUE,
addOverall = TRUE,
test = TRUE) %>%
kableone()
output
| |Overall |P1323 |P1403 |P1404 |p |test |
|:-----------------------|:------------|:------------|:------------|:------------|:------|:----|
|n |2323 |775 |776 |772 | | |
|TASK = right (%) |1164 (50.1) |390 (50.3) |386 (49.7) |388 (50.3) |0.969 | |
|Condition = R (%) |1168 (50.3) |385 (49.7) |435 (56.1) |348 (45.1) |<0.001 | |
|t1.key = space (%) |1168 (50.3) |385 (49.7) |435 (56.1) |348 (45.1) |<0.001 | |
|T1.response (mean (SD)) |0.50 (0.50) |0.50 (0.50) |0.56 (0.50) |0.45 (0.50) |<0.001 | |
|age (mean (SD)) |19.66 (0.94) |19.00 (0.00) |19.00 (0.00) |21.00 (0.00) |<0.001 | |
|T1.ACC (mean (SD)) |0.70 (0.46) |0.67 (0.47) |0.77 (0.42) |0.65 (0.48) |<0.001 | |
This makes more sense now, but is separate for each patient.
Alternatively, you can do this summary without using CreateTableOne
, e.g. yes
data_Confidence_in_Action %>% group_by(gender, ID) %>%
summarise(
age = min(age)) %>% group_by(gender) %>%
summarise(
n = n(),
Min = min(age),
Q1 = quantile(age,1/4,8),
mean = mean(age),
median = median(age),
Q3 = quantile(age,3/4,8),
Max = max(age),
IQR = IQR(age),
Kurt = e1071::kurtosis(age),
skew = e1071::skewness(age),
SD = sd(age))
output
# A tibble: 2 x 12
gender n Min Q1 mean median Q3 Max IQR Kurt skew SD
<fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 k 49 19 19 20.8 20 21 32 2 7.47 2.79 2.73
2 m 5 19 19 20.6 19 21 25 2 -1.29 0.823 2.61
Think carefully and write down what you really expect. Unless, of course, this topic is still interesting for you.