Starting in row 2, column 2, I need the numerical values separated out into the remaining columns (2:7)
This is the data
structure(list(State = c("State", "Alabama ", "Alaska ", "Arizona ",
"Arkansas ", "California "), `Number of farms 2022` = c("number",
"37,400 37,100 8,600 8,600 230 232", "1,200 1,200 870 870 725 725",
"16,700 16,400 25,500 25,000 1,527 1,524", "37,800 37,400 13,700 13,700 362 366",
"63,100 62,900 24,200 23,800 384 378"), `Number of farms 2023` = c("number",
NA, NA, NA, NA, NA), `Land in farms 2022` = c("1,000 acres",
NA, NA, NA, NA, NA), `Land in farms 2023` = c("1,000 acres",
NA, NA, NA, NA, NA), `Average farm size 2022` = c("acres", NA,
NA, NA, NA, NA), `Average farm size 2023` = c("acres", NA, NA,
NA, NA, NA)), row.names = c(NA, 6L), class = "data.frame")
I want it to look something like this:
State | Number of farms 2022 | Number of farms 2023 | Land in farms 2022 | Land in farms 2023 | Average farm size 2022 | Average farm size 2023
State | number | number | 1,000 acres | 1,000 acres | acres | acres
Alabama | 37,400 | 37,100 | 8,600 | 8,600 | 230 | 232
Since I have multiple files for different years, I would need this code to be easily applicable to all of them.
I have seen codes like this
library(dplyr)
library(tidyr)
df %>% separate(player, c('First', 'Last'))
but I'm not sure how to make it applicable to my data frame since the columns are already named.
Not sure if this helps, but the original data comes from the USDA website dated Feb 16, 2024: https://usda.library.cornell.edu/concern/publications/5712m6524
Then I load it in like this
library(tidyr)
library(dplyr)
library(splitstackshape)
df = read.delim('file.txt', header = F, stringsAsFactors = F)
df=df[106:167,]
df=data.frame(df)
df = cSplit(df, "df",":")
And do a ton of processing from there. I need the text file because I'm creating a time-series and USDA's ZIP files only go back to 2002.
Since this is about data cleaning/preparing read_fwf
might be the right tool for the job
library(readr)
library(dplyr)
read_fwf(file = "fnlo0224.txt",
col_positions = fwf_positions(c(1, 21, 30, 40, 49, 59, 68),
c(20, 29, 39, 48, 58, 67, 77)),
skip_empty_rows = TRUE)[109:164,] %>%
rename_with(~ c("state", "nfarms2022", "nfarms2023",
"landinfarms2022", "landinfarms2023", "avgfarmsize2022",
"avgfarmsize2023")) %>%
mutate(state = trimws(sub("\\..*:", "", state)),
across(-state, ~ as.numeric(gsub(",", "", .x)))) %>%
filter(state != ":")
# A tibble: 50 × 7
state nfarms2022 nfarms2023 landinfarms2022 landinfarms2023 avgfarmsize2022
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Alabama 37400 37100 8600 8600 230
2 Alaska 1200 1200 870 870 725
3 Arizona 16700 16400 25500 25000 1527
4 Arkans… 37800 37400 13700 13700 362
5 Califo… 63100 62900 24200 23800 384
6 Colora… 36100 35900 30200 30000 837
7 Connec… 5100 5100 370 370 73
8 Delawa… 2150 2150 520 520 242
9 Florida 44700 44400 9700 9700 217
10 Georgia 39200 39000 10000 10000 255
# ℹ 40 more rows
# ℹ 1 more variable: avgfarmsize2023 <dbl>
# ℹ Use `print(n = ...)` to see more rows
If you have multiple files you can simply cat
them together in a shell, import with read_fwf
and filter by row number.