I need to use lapply/sapply or other recursive methods for my real df for calculate how many repeated values have in each column/variable.
Here I used an small example to reproduce my case:
library(dplyr)
df <- data.frame(
var1 = c(1,2,3,4,5,6,7,8,9,10 ),
var2 = c(1,1,2,3,4,5,6,7,9,10 ),
var3 = c(1,1,1,2,3,4,5,6,7,8 ),
var4 = c(2,2,1,1,2,1,1,2,1,2 ),
var5 = c(1,1,1,1,1,4,5,5,6,7 ),
var6 = c(4,4,4,5,5,5,5,5,5,5 )
)
I have r nrow(df)
in my dataset and now I need to obtain the % of repeated values for each column. Suppose that my real df
have a lot of columns, and I need to do it recursively. I tryed to use lapply/sapply
, but it didn´t worked...
# create function that is used in lapply
perc_repeated <- function(variables){
paste(round((sum(table(df$variables)-1) / nrow(df))*100,2),"%")
}
perce_repeated_values <- lapply(df, perc_repeated)
perce_repeated_values
How to do this optimally if my dataframe increases in number of columns to something like 700, using some recursive function for each column and getting the results in an orderly way in a dataframe from largest to smallest ? (eg of the variable that has it 100% repeated values for the one that reaches 0%) in something like:
df_repeated
variable perc_repeated_values
var6 80%
var4 80%
var5 50%
var3 20%
var2 20%
var1 0%
This can easily be done with dplyr::summarize()
library(tidyverse)
df <- data.frame(
var1 = c(1,2,3,4,5,6,7,8,9,10 ),
var2 = c(1,1,2,3,4,5,6,7,9,10 ),
var3 = c(1,1,1,2,3,4,5,6,7,8 ),
var4 = c(2,2,1,1,2,1,1,2,1,2 ),
var5 = c(1,1,1,1,1,4,5,5,6,7 ),
var6 = c(4,4,4,5,5,5,5,5,5,5 )
)
df %>%
summarise(across(everything(),
~100 * (1 - n_distinct(.)/n()))) %>%
pivot_longer(everything(),
names_to = "var",
values_to = "percent_repeated") %>%
arrange(desc(percent_repeated))
#> # A tibble: 6 x 2
#> var percent_repeated
#> <chr> <dbl>
#> 1 var4 80
#> 2 var6 80
#> 3 var5 50
#> 4 var3 20
#> 5 var2 10
#> 6 var1 0
Created on 2022-01-09 by the reprex package (v2.0.1)