rgroupingdensity-plotexploratory-data-analysisr-data-explorer

DataExplorer, customize univariate distribution


I am trying to use DataExplorer to help with quick EDA. I like how it shows univariate distributions. Here is a reproducible example.

A <- c(rep(c(1,2,3,4,5), 200))
A<- factor(A)
B <- c(x=rnorm(1000))
C <- c(x= rnorm(1000, mean = 100, sd=2))
D <- c(x= rnorm(1000, 2, 2))
df<- data.frame(A, B, C, D)
df %>%
  create_report(
    output_file = "trial",
    y= "A", #to get barplots, QQ plots and scatterplots by grouping variable "A"
    report_title = "trial_EDA",
    config = configure_report(
      add_plot_density = TRUE  #To add density plots to report
    )
  )

I want to visualize density by grouping variable, "A", as shown in the picture attached.enter image description here

But I don't know how to use plot density args properly to do this. Also, please suggest other packages to easily navigate through large datasets as a preliminary analysis. Thanks!


Solution

  • You have not specified which variable the B, C or D density graph should apply to. If there is only one, e.g. B then do it like this:

    library(tidyverse)
    library(ggpubr)
    
    A <- c(rep(c(1,2,3,4,5), 200))
    A<- factor(A)
    B <- c(x=rnorm(1000))
    C <- c(x= rnorm(1000, mean = 100, sd=2))
    D <- c(x= rnorm(1000, 2, 2))
    df<- data.frame(A, B, C, D)
    
    df %>% mutate(A = A %>% fct_inorder()) %>% 
      ggplot(aes(B, fill=A)) +
      geom_density(alpha=0.2)
    
    

    enter image description here

    You can also do it separately for each of the variables on one plot.

    pB = df %>% mutate(A = A %>% fct_inorder()) %>% 
      ggplot(aes(B, fill=A)) +
      geom_density(alpha=0.2)
    pC = df %>% mutate(A = A %>% fct_inorder()) %>% 
      ggplot(aes(C, fill=A)) +
      geom_density(alpha=0.2)
    
    pD = df %>% mutate(A = A %>% fct_inorder()) %>% 
      ggplot(aes(D, fill=A)) +
      geom_density(alpha=0.2)
    
    ggarrange(pB, pC, pD, 
              labels = c("B", "C", "D"))
    

    enter image description here

    And if you don't like the fillings, you can do it like this

    df %>% mutate(A = A %>% fct_inorder()) %>% 
      ggplot(aes(B, color=A)) +
      geom_density()
    

    enter image description here

    Update 1

    It is possible to create charts for any number of columns. I will show it to you in the example below. First, we'll do it in a very simple, even trivial way.

    library(tidyverse)
    df = tibble(
      A = rep(c(1,2,3,4,5), 200) %>% factor(),
      B = rnorm(1000),
      C = rnorm(1000, mean = 100, sd=2),
      D = rnorm(1000, 2, 2)
    )
    
    fPlot = function(x, group) tibble(x=x, group=group) %>% 
      ggplot(aes(x, color=group)) +
        geom_density()
    
    df %>% select_at(vars(B:D)) %>% 
        map(~fPlot(., df$A))
    

    As you can see, we created three plots for variables B, C and D.

    The second way is a bit more difficult to understand. But it will give you some extra bonuses.

    fPlot2 = function(df, group) df$data[[1]] %>% 
      ggplot(aes(val, color=A)) +
      geom_density() +
      ggtitle(group)
    
    df %>% pivot_longer(B:D, names_to = "var", values_to = "val") %>% 
      group_by(var) %>% 
      nest() %>% 
      group_map(fPlot2)
    

    Note that your tibble after df %>% pivot_longer(B:D, names_to = "var", values_to = "val") looks like this.

    # A tibble: 3,000 x 3
       A     var        val
       <fct> <chr>    <dbl>
     1 1     B       1.06  
     2 1     C     100.    
     3 1     D       3.54  
     4 2     B      -0.652 
     5 2     C     100.    
     6 2     D       1.12  
     7 3     B       0.652 
     8 3     C      97.3   
     9 3     D       3.57  
    10 4     B      -0.0972
    # ... with 2,990 more rows
    

    After doing df %>% pivot_longer(B:D, names_to = "var", values_to = "val") %>% group_by(var) %>% nest() looks like this:

    # A tibble: 3 x 2
    # Groups:   var [3]
      var   data                
      <chr> <list>              
    1 B     <tibble [1,000 x 2]>
    2 C     <tibble [1,000 x 2]>
    3 D     <tibble [1,000 x 2]>
    

    As you can see the data has been collapsed into three internal tibble in the variable data. This approach will allow you to easily calculate all statistics for each column separately. Look at this.

    fStat = function(df) df$data[[1]] %>% 
      group_by(A) %>% 
      summarise(
        n = n(),
        min = min(val),
        mean = mean(val),
        max = max(val),
        median = median(val),
        sd = sd(val),
        sw.stat = stats::shapiro.test(val)$statistic,
        sw.p = stats::shapiro.test(val)$p.value,
      )
    
    df %>% pivot_longer(B:D, names_to = "var", values_to = "val") %>% 
      group_by(var) %>% 
      nest() %>% 
      group_modify(~fStat(.x))
    

    output

    # A tibble: 15 x 10
    # Groups:   var [3]
       var   A         n   min      mean    max     median    sd sw.stat  sw.p
       <chr> <fct> <int> <dbl>     <dbl>  <dbl>      <dbl> <dbl>   <dbl> <dbl>
     1 B     1       200 -2.14   0.139     3.16   0.153    0.960   0.994 0.561
     2 B     2       200 -2.00   0.0185    2.61   0.0162   0.923   0.992 0.373
     3 B     3       200 -3.15   0.0245    2.42   0.0718   1.02    0.992 0.347
     4 B     4       200 -2.75   0.00112   2.73  -0.00691  1.02    0.993 0.496
     5 B     5       200 -3.32  -0.00758   3.23  -0.000105 0.993   0.991 0.250
     6 C     1       200 94.6   99.8     104.    99.8      1.97    0.992 0.365
     7 C     2       200 94.8  100.      104.   100.       1.85    0.991 0.290
     8 C     3       200 94.5  100.      106.   100.       1.94    0.996 0.877
     9 C     4       200 94.4   99.9     107.    99.9      1.97    0.993 0.463
    10 C     5       200 94.3   99.8     106.    99.8      2.07    0.996 0.887
    11 D     1       200 -4.89   1.81      8.11   1.90     2.09    0.995 0.750
    12 D     2       200 -5.42   2.15      7.57   2.18     2.14    0.995 0.726
    13 D     3       200 -4.38   2.09      7.10   2.02     1.97    0.989 0.111
    14 D     4       200 -4.73   2.13      8.98   1.93     1.99    0.989 0.138
    15 D     5       200 -2.19   2.24      7.79   2.25     1.87    0.996 0.867
    

    Czy to nie fajne?