I am trying to use DataExplorer to help with quick EDA. I like how it shows univariate distributions. Here is a reproducible example.
A <- c(rep(c(1,2,3,4,5), 200))
A<- factor(A)
B <- c(x=rnorm(1000))
C <- c(x= rnorm(1000, mean = 100, sd=2))
D <- c(x= rnorm(1000, 2, 2))
df<- data.frame(A, B, C, D)
df %>%
create_report(
output_file = "trial",
y= "A", #to get barplots, QQ plots and scatterplots by grouping variable "A"
report_title = "trial_EDA",
config = configure_report(
add_plot_density = TRUE #To add density plots to report
)
)
I want to visualize density by grouping variable, "A", as shown in the picture attached.
But I don't know how to use plot density args properly to do this. Also, please suggest other packages to easily navigate through large datasets as a preliminary analysis. Thanks!
You have not specified which variable the B
, C
or D
density graph should apply to.
If there is only one, e.g. B
then do it like this:
library(tidyverse)
library(ggpubr)
A <- c(rep(c(1,2,3,4,5), 200))
A<- factor(A)
B <- c(x=rnorm(1000))
C <- c(x= rnorm(1000, mean = 100, sd=2))
D <- c(x= rnorm(1000, 2, 2))
df<- data.frame(A, B, C, D)
df %>% mutate(A = A %>% fct_inorder()) %>%
ggplot(aes(B, fill=A)) +
geom_density(alpha=0.2)
You can also do it separately for each of the variables on one plot.
pB = df %>% mutate(A = A %>% fct_inorder()) %>%
ggplot(aes(B, fill=A)) +
geom_density(alpha=0.2)
pC = df %>% mutate(A = A %>% fct_inorder()) %>%
ggplot(aes(C, fill=A)) +
geom_density(alpha=0.2)
pD = df %>% mutate(A = A %>% fct_inorder()) %>%
ggplot(aes(D, fill=A)) +
geom_density(alpha=0.2)
ggarrange(pB, pC, pD,
labels = c("B", "C", "D"))
And if you don't like the fillings, you can do it like this
df %>% mutate(A = A %>% fct_inorder()) %>%
ggplot(aes(B, color=A)) +
geom_density()
Update 1
It is possible to create charts for any number of columns. I will show it to you in the example below. First, we'll do it in a very simple, even trivial way.
library(tidyverse)
df = tibble(
A = rep(c(1,2,3,4,5), 200) %>% factor(),
B = rnorm(1000),
C = rnorm(1000, mean = 100, sd=2),
D = rnorm(1000, 2, 2)
)
fPlot = function(x, group) tibble(x=x, group=group) %>%
ggplot(aes(x, color=group)) +
geom_density()
df %>% select_at(vars(B:D)) %>%
map(~fPlot(., df$A))
As you can see, we created three plots for variables B
, C
and D
.
The second way is a bit more difficult to understand. But it will give you some extra bonuses.
fPlot2 = function(df, group) df$data[[1]] %>%
ggplot(aes(val, color=A)) +
geom_density() +
ggtitle(group)
df %>% pivot_longer(B:D, names_to = "var", values_to = "val") %>%
group_by(var) %>%
nest() %>%
group_map(fPlot2)
Note that your tibble
after df %>% pivot_longer(B:D, names_to = "var", values_to = "val")
looks like this.
# A tibble: 3,000 x 3
A var val
<fct> <chr> <dbl>
1 1 B 1.06
2 1 C 100.
3 1 D 3.54
4 2 B -0.652
5 2 C 100.
6 2 D 1.12
7 3 B 0.652
8 3 C 97.3
9 3 D 3.57
10 4 B -0.0972
# ... with 2,990 more rows
After doing df %>% pivot_longer(B:D, names_to = "var", values_to = "val") %>% group_by(var) %>% nest()
looks like this:
# A tibble: 3 x 2
# Groups: var [3]
var data
<chr> <list>
1 B <tibble [1,000 x 2]>
2 C <tibble [1,000 x 2]>
3 D <tibble [1,000 x 2]>
As you can see the data has been collapsed into three internal tibble
in the variable data
.
This approach will allow you to easily calculate all statistics for each column separately. Look at this.
fStat = function(df) df$data[[1]] %>%
group_by(A) %>%
summarise(
n = n(),
min = min(val),
mean = mean(val),
max = max(val),
median = median(val),
sd = sd(val),
sw.stat = stats::shapiro.test(val)$statistic,
sw.p = stats::shapiro.test(val)$p.value,
)
df %>% pivot_longer(B:D, names_to = "var", values_to = "val") %>%
group_by(var) %>%
nest() %>%
group_modify(~fStat(.x))
output
# A tibble: 15 x 10
# Groups: var [3]
var A n min mean max median sd sw.stat sw.p
<chr> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 B 1 200 -2.14 0.139 3.16 0.153 0.960 0.994 0.561
2 B 2 200 -2.00 0.0185 2.61 0.0162 0.923 0.992 0.373
3 B 3 200 -3.15 0.0245 2.42 0.0718 1.02 0.992 0.347
4 B 4 200 -2.75 0.00112 2.73 -0.00691 1.02 0.993 0.496
5 B 5 200 -3.32 -0.00758 3.23 -0.000105 0.993 0.991 0.250
6 C 1 200 94.6 99.8 104. 99.8 1.97 0.992 0.365
7 C 2 200 94.8 100. 104. 100. 1.85 0.991 0.290
8 C 3 200 94.5 100. 106. 100. 1.94 0.996 0.877
9 C 4 200 94.4 99.9 107. 99.9 1.97 0.993 0.463
10 C 5 200 94.3 99.8 106. 99.8 2.07 0.996 0.887
11 D 1 200 -4.89 1.81 8.11 1.90 2.09 0.995 0.750
12 D 2 200 -5.42 2.15 7.57 2.18 2.14 0.995 0.726
13 D 3 200 -4.38 2.09 7.10 2.02 1.97 0.989 0.111
14 D 4 200 -4.73 2.13 8.98 1.93 1.99 0.989 0.138
15 D 5 200 -2.19 2.24 7.79 2.25 1.87 0.996 0.867
Czy to nie fajne?