rimportcdcfixed-width

How do I import a file into r with extension .DUSMCPUB?


I’m trying to import the Mortality Multiple Cause Files from the National Center for Health Statistics, located at this link:

https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm#Downloadable

link to image of where to find file on NCHS website

The files have an extension .DUSMCPUB (e.g., the file for 2020 is called "VS20MORT.DUSMCPUB_r20220105”). How do I import such a file? I’m not familiar with the extension.

I have tried to import with the following code, but it causes my R program to terminate. Can you please provide me with a suggestion on how to import these types of files?

VS20MORT <- read_delim("VS20MORT.DUSMCPUB_r20220105")

Solution

  • Thanks @Mel G for sharing this approach. When I tried to run it, I realized that the mortality file includes a few new variables as of 2020 (namely decedent’s occupation and industry). Here’s a slight variation that includes the new variables.

    # Install and load necessary packages
    # install.packages("sqldf") # Used to read in DUSMCPUB file
    # install.packages("dplyr") # Used for tidy data management
    library(sqldf)
    library(dplyr)
    
    #Increase memory limit to make space for large file
    # memory.limit()
    memory.limit(size=20000)
    
    # Create dataframe containing variables for column width, name, and end position
    columns <- data.frame(widths=c(19,1,40,2,1,1,2,2,1,4,1,2,2,2,2,1,1,1,16,4,1,1,1,
                                   1,34,1,1,4,3,1,3,3,2,1,2,7,7,7,7,7,7,7,7,7,7,7,7,
                                   7,7,7,7,7,7,7,7,36,2,1,5,5,5,5,5,5,5,5,5,5,5,5,5,
                                   5,5,5,5,5,5,5,1,2,1,1,1,1,33,3,1,1,2,315,4,2,4,2))
    columns$names <- c("blank1", # tape locations 1-19
                       "Resident_Status_US",  # tape location 20
                       "blank2", 
                       "Education_1989",
                       "Education_2003",
                       "Education_flag", 
                       "Month_of_Death", 
                       "blank3",
                       "Sex", 
                       "DetailAge",
                       "Age_Substitution_Flag", 
                       "Age_Recode_52",
                       "Age_Recode_27", 
                       "Age_Recode_12",
                       "Infant_Age_Recode_22",
                       "Place_of_Death_and_Status",
                       "Marital_Status",
                       "Day_of_Week_of_Death",
                       "blank4",
                       "Current_Data_Year",
                       "Injury_at_Work",
                       "Manner_of_Death",
                       "Method_of_Disposition",
                       "Autopsy",
                       "blank5",
                       "Activity_Code",
                       "Place_of_Injury",
                       "ICD_Code_10",
                       "Cause_Recode_358",
                       "blank6",
                       "Cause_Recode_113",
                       "Infant_Cause_Recode_130",
                       "Cause_Recode_39",
                       "blank7",
                       "Number_Entity_Axis_Conditions",
                       "Condition_1EA", "Condition_2EA", "Condition_3EA", "Condition_4EA", "Condition_5EA",
                       "Condition_6EA", "Condition_7EA", "Condition_8EA", "Condition_9EA", "Condition_10EA",
                       "Condition_11EA", "Condition_12EA", "Condition_13EA", "Condition_14EA", "Condition_15EA",
                       "Condition_16EA", "Condition_17EA", "Condition_18EA", "Condition_19EA", "Condition_20EA",
                       "blank8",
                       "Number_Record_Axis_Conditions",
                       "blank9",
                       "Condition_1RA", "Condition_2RA", "Condition_3RA", "Condition_4RA", "Condition_5RA",
                       "Condition_6RA", "Condition_7RA", "Condition_8RA", "Condition_9RA", "Condition_10RA",
                       "Condition_11RA", "Condition_12RA", "Condition_13RA", "Condition_14RA", "Condition_15RA",
                       "Condition_16RA", "Condition_17RA", "Condition_18RA", "Condition_19RA", "Condition_20RA",
                       "blank10",
                       "Race",
                       "Bridged_Race_Flag",
                       "Race_Imputation_Flag",
                       "Race_Recode_3",
                       "Race_Recode_5",
                       "blank11",
                       "Hispanic_Origin",
                       "blank12",
                       "Hispanic_Origin_9_Race_Recode",
                       "Race_Recode_40",
                       "blank13",
                       "CensusOcc",
                       "Occ_26",
                       "CensusInd",
                       "Ind_23")
    
    # Read in file using parameters from 'columns' dataframe
    mort2020<- read.fwf("VS20MORT.DUSMCPUB_r20220105", widths=columns$widths, stringsAsFactors=F)
    # Attach column names to variables
    colnames(mort2020) <- columns$names
    
    # Remove blank variables
    mort2020x <- mort2020 %>% dplyr::select(-starts_with("blank"))
    

    Alternatively, it looks like the files are published for most years in a CSV format here: https://www.nber.org/research/data/mortality-data-vital-statistics-nchs-multiple-cause-death-data. 2020 isn’t up yet, but for other years, it can be much faster to read a CSV into R than to use read.fwf.