I have downloaded NOOA archive of hurricanes, the original data was a text file that I imported and saved as csv. Now I am trying to reorder the data, unfortunately observation names are mixed (start with 2 characters , i.e AL011851) with dates (with no hyphen). I have tried different things to no avail. How to proceed from there? I would really appreciate any help or ideas.
Here is a dput of the data head:
structure(list(date_index = c("AL011851", "18510625", "18510625",
"18510625", "18510625", "18510625"), time = c(" UNNAMED",
" 0000", " 0600", " 1200", " 1800", " 2100"), n_entries = c(" 14",
" ", " ", " ", " ", " L"), type = c("", " HU", " HU", " HU",
" HU", " HU"), wind = c(NA, 80L, 80L, 80L, 80L, 80L), newname = c(TRUE,
FALSE, FALSE, FALSE, FALSE, FALSE)), row.names = c(NA, 6L), class = "data.frame")
I'm going to assume that you are using the Atlantic hurricane database (HURDAT2) 1851-2021 or something very like it.
The issue is that this file is not a very good CSV file. It contains some rows that identify the hurricane, and other rows that contain the data for that hurricane. For example:
AL021851, UNNAMED, 1,
18510705, 1200, , HU, 22.2N, 97.6W, 80, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999
It looks like the first type of row contains the hurricane identifier, name and number of observations. It would be better if those 3 pieces of information were in their own columns, rather than in a row - this would be tidy data.
So we will have to tidy the data after reading it into R. Here is one approach:
read_csv
without a header rowHere's a tidyverse approach to achieve the above steps:
library(dplyr)
library(tidyr)
library(stringr)
library(readr)
hurdat2_1851_2021_041922 <- read_csv("https://www.nhc.noaa.gov/data/hurdat/hurdat2-1851-2021-041922.txt",
col_names = FALSE)
hurdat2 <- hurdat2_1851_2021_041922 %>%
mutate(ID = ifelse(str_detect(X1, "^[A-Z]"), X1, NA),
Name = ifelse(str_detect(X2, "^[A-Z]"), X2, NA),
N_obs = ifelse(str_detect(X3, "^\\d+$"), X3, NA)) %>%
fill(ID, Name, N_obs) %>%
filter(!str_detect(X1, "^[A-Z]")) %>%
separate(X4, sep = ",\\s+", into = paste0("X", 4:21))
And here is the result (first few lines):
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 ID Name N_obs
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 18510625 0000 NA HU 28.0N 94.8W 80 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 AL011851 UNNAMED 14
2 18510625 0600 NA HU 28.0N 95.4W 80 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 AL011851 UNNAMED 14
3 18510625 1200 NA HU 28.0N 96.0W 80 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 AL011851 UNNAMED 14
4 18510625 1800 NA HU 28.1N 96.5W 80 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 AL011851 UNNAMED 14
5 18510625 2100 L HU 28.2N 96.8W 80 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 AL011851 UNNAMED 14
6 18510626 0000 NA HU 28.2N 97.0W 70 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 AL011851 UNNAMED 14
7 18510626 0600 NA TS 28.3N 97.6W 60 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 AL011851 UNNAMED 14
8 18510626 1200 NA TS 28.4N 98.3W 60 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 AL011851 UNNAMED 14
9 18510626 1800 NA TS 28.6N 98.9W 50 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 AL011851 UNNAMED 14
10 18510627 0000 NA TS 29.0N 99.4W 50 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 AL011851 UNNAMED 14
Finally, you will want to rename the columns and perhaps convert some of them to the correct type e.g. Date for X1. I'm not familiar with the data structure so I'll leave that for you.