rdataframefunctionrecursiondata-quality

Recursive method for calculate percentual of repeated values for each column in my df with R


I need to use lapply/sapply or other recursive methods for my real df for calculate how many repeated values have in each column/variable.

Here I used an small example to reproduce my case:

library(dplyr)

df <- data.frame(
var1 = c(1,2,3,4,5,6,7,8,9,10 ),
var2 = c(1,1,2,3,4,5,6,7,9,10 ),
var3 = c(1,1,1,2,3,4,5,6,7,8 ),
var4 = c(2,2,1,1,2,1,1,2,1,2 ),
var5 = c(1,1,1,1,1,4,5,5,6,7 ),
var6 = c(4,4,4,5,5,5,5,5,5,5 )   
)

I have r nrow(df) in my dataset and now I need to obtain the % of repeated values for each column. Suppose that my real df have a lot of columns, and I need to do it recursively. I tryed to use lapply/sapply, but it didn´t worked...

# create function that is used in lapply
perc_repeated <- function(variables){
  
  paste(round((sum(table(df$variables)-1) / nrow(df))*100,2),"%")
  
}

perce_repeated_values <- lapply(df, perc_repeated) 
perce_repeated_values

How to do this optimally if my dataframe increases in number of columns to something like 700, using some recursive function for each column and getting the results in an orderly way in a dataframe from largest to smallest ? (eg of the variable that has it 100% repeated values for the one that reaches 0%) in something like:

df_repeated

variable      perc_repeated_values
var6                    80%
var4                    80%
var5                    50%
var3                    20%
var2                    20%
var1                     0%


Solution

  • This can easily be done with dplyr::summarize()

    library(tidyverse)
    
    df <- data.frame(
      var1 = c(1,2,3,4,5,6,7,8,9,10 ),
      var2 = c(1,1,2,3,4,5,6,7,9,10 ),
      var3 = c(1,1,1,2,3,4,5,6,7,8 ),
      var4 = c(2,2,1,1,2,1,1,2,1,2 ),
      var5 = c(1,1,1,1,1,4,5,5,6,7 ),
      var6 = c(4,4,4,5,5,5,5,5,5,5 )   
    )
    
    df %>% 
      summarise(across(everything(),
                       ~100 * (1 - n_distinct(.)/n()))) %>% 
      pivot_longer(everything(), 
                   names_to = "var", 
                   values_to = "percent_repeated") %>% 
      arrange(desc(percent_repeated))
    #> # A tibble: 6 x 2
    #>   var   percent_repeated
    #>   <chr>            <dbl>
    #> 1 var4                80
    #> 2 var6                80
    #> 3 var5                50
    #> 4 var3                20
    #> 5 var2                10
    #> 6 var1                 0
    

    Created on 2022-01-09 by the reprex package (v2.0.1)