rimportfixed-widthvariable-length

Importing Variable length text files per individual into R


I would like to import a very large text file as a dataframe into R. The file is produced by the "National Institute for Health and Disability Insurance" in Belgium. It contains individual data on health professionals in Belgium.

It's a fixed width text file, and I usualy import that kind of data with "read_fwf". The problem is that the structure of the file is a bit tricky :

Here's a fictive exemple of the structure, very simplified. The "prefixed part" is the 0000xxx part. The 5 first characters indicates the individual. The 2 last characters of the "prefixed part" specifies the "type" of the row. Each row contains data specific to that "type", divided into variables (the width structure of the variables is indicated in comment) :

0000111abaabacd
0000112abbbba
0000113ccaxyzzghj5
0000211acatbacz
0000212zbabba
0000311tyyyuacd
0000312xbbiop
0000411pkggbacz

#type 11 has 3 variables. Width for each : 2,3,3
#type 12 has 2 variables. Width for each : 3,3
#type 13 has 4 variables. Width for each : 3,3,1,4

So, what i'd like to do is to import into a dataframe in R each individual into one unique row, with all the variables of each type into separated columns. If the variables are missing (because there is no type/row), i'd like indicate empty data (NA) for each variable.

id     var1  var2  var3  var4  var5  var6  var7  var8  var9
00001    ab   aab   acd   abb   bba   cca   xyz     z  ghj5
00002    ac   atb   acz   zba   bba    NA    NA    NA    NA  
00003    ty   yyu   acd   xbb   iop    NA    NA    NA    NA
00004    pk   ggb   acz    NA    NA    NA    NA    NA    NA

I'm a bit lost, I don't know where to start, because the structure is really not usual for me. Do you have ideas or advices ?

Thank you very much !


Solution

  • Assuming the input shown reproducibly in the Note at the end read it as 3 fields and then reshape to wide form with one column per type. Finally separate the fields in each such column.

    library(dplyr)
    library(readr)
    library(tidyr)
    
    cols <- fwf_cols(id = 5, type = 2, misc = NA)
    
    # DF <- read_fwf("myfile", cols)
    DF <- read_fwf(Lines, cols)
    
    DF %>%
      pivot_wider(names_from = "type", values_from = "misc") %>%
      separate(`11`, c("var1", "var2", "var3"), cumsum(c(2, 3, 2))) %>%
      separate(`12`, c("var4", "var5"), cumsum(c(3, 3))) %>%
      separate(`13`, c("var6", "var7", "var8", "var9"), cumsum(c(3, 3, 1, 4)))
    

    giving:

    # A tibble: 4 x 10
      id    var1  var2  var3  var4  var5  var6  var7  var8  var9 
      <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
    1 00001 ab    aab   ac    abb   bba   cca   xyz   z     ghj5 
    2 00002 ac    atb   ac    zba   bba   <NA>  <NA>  <NA>  <NA> 
    3 00003 ty    yyu   ac    xbb   iop   <NA>  <NA>  <NA>  <NA> 
    4 00004 pk    ggb   ac    <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
    

    Note

    Lines <- "0000111abaabacd
    0000112abbbba
    0000113ccaxyzzghj5
    0000211acatbacz
    0000212zbabba
    0000311tyyyuacd
    0000312xbbiop
    0000411pkggbacz"