rdataframeduplicatesdata.table

R identify columns having the same value for the entire dataframe


I am trying to identify the columns in data tables where all of the entries in each column are the same. The challenge is that the value may be different classes within and across the different tables I have. For example the column may have all NA's or all 0's (or a different numeric value) or all the same character "E" or the same character string, or the same logic value. At this point I do not need to perform an operation on the columns with repeated values, I just identify the column names.

Below I have included code to produce a toy data.table (does not need to be a data.table solution) and methods I've come up with to identify the columns with all NA, all 0's, and all 1's. I'm wondering if there is a more generic solution to this rather than having to have code specific to each duplication type.

library(data.table)
library(random)
AlphaNum=randomStrings(n=10, len=3, digits=T, upperalpha=T, loweralpha=TRUE, unique=F, check=T)
tat <- data.table(Nums=1:10, blankr1=rep(NA,10), Leters=LETTERS[1:10], niner=rep(999,10), blankr2=rep('',10), Er=LETTERS[5],
AlphaNum=AlphaNum,AlphaNumr=rep('9e4UU',10),zeror=rep(0,10), oner=rep(1,10))

# finds where all are NA's
names(tat[,.SD,.SDcols=colSums(is.na(tat))==tat[,.N]])

# finds where all are 0's
numcols <- tat[,lapply(tat, is.numeric)==T, with=F]
names(numcols[,which(tat[,lapply(.SD, function(x) sum(x,na.rm=T)), .SDcols=names(numcols)]==0),with=F])

# finds where all are 1's
numcols <- tat[,lapply(tat, is.numeric)==T, with=F]
names(numcols[,which(tat[,lapply(.SD, function(x) sum(x,na.rm=T)), .SDcols=names(numcols)]==tat[,.N]),with=F])

Solution

  • Instead of checking against particular values, we can just look at the number of unique values. This will work for (almost?) any kind of data. We can easily check the number of unique values against 1, across all columns like so:

    sapply(tat, \(x) length(unique(x)) == 1)
    

    To get the names of the columns, we can therefore do:

    names(tat)[sapply(tat, \(x) length(unique(x)) == 1)]
    

    There are many similar ways of doing the same thing, like e.g. this tidyverse version:

    names(tat)[map_int(tat, n_distinct) == 1]