I am trying to load into r and wrangle data from the CPS (Current Population Survey) which can be downloaded at this link. There is an ostensible codebook for the information on variables and the ranges of each variable as coded in files such as those at this link and this link (Both examples are for the 2004 dataset). However, when I try to load the data file into r, I encounter issues where the variables seem to match up but there are gaps. If I open the 2004 file with the following code:
library(data.table)
cps_raw <- read.delim("cpsmay04.dat", header = FALSE, sep = "")
This gives me the following output:
which differs from when I change the separator to be a full space:
library(data.table)
cps_raw <- read.delim("cpsmay04.dat", header = FALSE, sep = " ")
I am inclined to think the first one is correct, though I'm not sure it gets all the columns correct, since the spaces between values are different. The codebook does not seem to indicate how the data should be loaded in, and I am only used to working with .csv, .dta, .xlsx or .sav files. If the first option is correct, how do I know which variable is which? Do I simply index by the codebook? If so, is there a more efficient way to rename the variables throughout the dataset when loading it into r? Am I loading it incorrectly such that the variable names are not loaded, or must this be done manually? If it must be done manually, is there a way to only load part of the dataframe (certain columns) to not have to deal with many variables that are not necessary, since there are so many?
The following code was tested with the data in the question link, no need for SPSS , SAS or STATA files.
The main problem is to parse the .ddf file, which is not that difficult considering that the data width lines always begin with the character "D"
. Filter those lines, split by spaces and coerce the two last columns to integer. Then use column "SIZE"
as column widths.
process_ddf <- function(file) {
ddf <- readLines(file)
i <- grep("^D ", ddf)
mat <- ddf[i] |>
strsplit(" +") |>
sapply(`[`, -1L) |>
t()
out <- mat |>
as.data.frame() |>
setNames(c("DATA", "SIZE", "BEGIN"))
out[-1L] <- lapply(out[-1L], as.integer)
if(anyNA(out[-1L])) {
mat[, -1L] <- apply(mat[, -1L], 2L, \(x) gsub("\\D", "", x))
out[, -1L] <- apply(mat[, -1L], 2L, as.integer)
}
out
}
data_file <- "cpsmay04.dat"
ddf_link <- "http://data.nber.org/cps/cpsmay04.ddf"
d <- process_ddf(ddf_link)
# some columns are fillers, need to make unique col names
filler <- grep("FILLER", d$DATA)
padding <- grep("PADDING", d$DATA)
d$DATA[filler] <- paste(d$DATA, filler, sep = "_")
d$DATA[padding] <- paste(d$DATA, padding, sep = "_")
# now read in the data
# read.fwf is painfully slow, use readr::read_fwf
library(readr)
cpsmay04 <- read_fwf(data_file, fwf_widths(d$SIZE, d$DATA))
# optional
spc <- spec(cpsmay04)
spc