rdataframedata-manipulation

Separate one column into multiple in R


Starting in row 2, column 2, I need the numerical values separated out into the remaining columns (2:7)

This is the data

structure(list(State = c("State", "Alabama ", "Alaska ", "Arizona ", 
"Arkansas ", "California "), `Number of farms 2022` = c("number", 
"37,400    37,100    8,600     8,600      230       232", "1,200     1,200      870       870      725       725", 
"16,700    16,400   25,500    25,000    1,527     1,524", "37,800    37,400   13,700    13,700      362       366", 
"63,100    62,900   24,200    23,800      384       378"), `Number of farms 2023` = c("number", 
NA, NA, NA, NA, NA), `Land in farms 2022` = c("1,000 acres", 
NA, NA, NA, NA, NA), `Land in farms 2023` = c("1,000 acres", 
NA, NA, NA, NA, NA), `Average farm size 2022` = c("acres", NA, 
NA, NA, NA, NA), `Average farm size 2023` = c("acres", NA, NA, 
NA, NA, NA)), row.names = c(NA, 6L), class = "data.frame")

I want it to look something like this:

State | Number of farms 2022 | Number of farms 2023 | Land in farms 2022 | Land in farms 2023 | Average farm size 2022 | Average farm size 2023
State | number | number | 1,000 acres | 1,000 acres | acres | acres
Alabama | 37,400  |  37,100  |  8,600   |  8,600   |   230   |    232

Since I have multiple files for different years, I would need this code to be easily applicable to all of them.

I have seen codes like this

library(dplyr)
library(tidyr)
df %>% separate(player, c('First', 'Last'))

but I'm not sure how to make it applicable to my data frame since the columns are already named.

Not sure if this helps, but the original data comes from the USDA website dated Feb 16, 2024: https://usda.library.cornell.edu/concern/publications/5712m6524

Then I load it in like this

library(tidyr)
library(dplyr)
library(splitstackshape)

df = read.delim('file.txt', header = F, stringsAsFactors = F)

df=df[106:167,]
df=data.frame(df)

df = cSplit(df, "df",":")

And do a ton of processing from there. I need the text file because I'm creating a time-series and USDA's ZIP files only go back to 2002.


Solution

  • Since this is about data cleaning/preparing read_fwf might be the right tool for the job

    library(readr)
    library(dplyr)
    
    read_fwf(file = "fnlo0224.txt", 
             col_positions = fwf_positions(c(1,  21, 30, 40, 49, 59, 68), 
                                           c(20, 29, 39, 48, 58, 67, 77)),
             skip_empty_rows = TRUE)[109:164,] %>% 
      rename_with(~ c("state", "nfarms2022", "nfarms2023", 
                      "landinfarms2022", "landinfarms2023", "avgfarmsize2022", 
                      "avgfarmsize2023")) %>% 
      mutate(state = trimws(sub("\\..*:", "", state)), 
             across(-state, ~ as.numeric(gsub(",", "", .x)))) %>% 
      filter(state != ":")
    
    # A tibble: 50 × 7
       state   nfarms2022 nfarms2023 landinfarms2022 landinfarms2023 avgfarmsize2022
       <chr>        <dbl>      <dbl>           <dbl>           <dbl>           <dbl>
     1 Alabama      37400      37100            8600            8600             230
     2 Alaska        1200       1200             870             870             725
     3 Arizona      16700      16400           25500           25000            1527
     4 Arkans…      37800      37400           13700           13700             362
     5 Califo…      63100      62900           24200           23800             384
     6 Colora…      36100      35900           30200           30000             837
     7 Connec…       5100       5100             370             370              73
     8 Delawa…       2150       2150             520             520             242
     9 Florida      44700      44400            9700            9700             217
    10 Georgia      39200      39000           10000           10000             255
    # ℹ 40 more rows
    # ℹ 1 more variable: avgfarmsize2023 <dbl>
    # ℹ Use `print(n = ...)` to see more rows
    

    If you have multiple files you can simply cat them together in a shell, import with read_fwf and filter by row number.