rmultiple-columnsrowsmixed

Reading a csv dataset in R where lines and columns are mixed


I have downloaded NOOA archive of hurricanes, the original data was a text file that I imported and saved as csv. Now I am trying to reorder the data, unfortunately observation names are mixed (start with 2 characters , i.e AL011851) with dates (with no hyphen). I have tried different things to no avail. How to proceed from there? I would really appreciate any help or ideas.

Here is a dput of the data head:

structure(list(date_index = c("AL011851", "18510625", "18510625", 
"18510625", "18510625", "18510625"), time = c("            UNNAMED", 
" 0000", " 0600", " 1200", " 1800", " 2100"), n_entries = c("     14", 
"  ", "  ", "  ", "  ", " L"), type = c("", " HU", " HU", " HU", 
" HU", " HU"), wind = c(NA, 80L, 80L, 80L, 80L, 80L), newname = c(TRUE, 
FALSE, FALSE, FALSE, FALSE, FALSE)), row.names = c(NA, 6L), class = "data.frame")

Solution

  • I'm going to assume that you are using the Atlantic hurricane database (HURDAT2) 1851-2021 or something very like it.

    The issue is that this file is not a very good CSV file. It contains some rows that identify the hurricane, and other rows that contain the data for that hurricane. For example:

    AL021851,            UNNAMED,      1,
    18510705, 1200,  , HU, 22.2N,  97.6W,  80, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999
    

    It looks like the first type of row contains the hurricane identifier, name and number of observations. It would be better if those 3 pieces of information were in their own columns, rather than in a row - this would be tidy data.

    So we will have to tidy the data after reading it into R. Here is one approach:

    1. Read the data into R using read_csv without a header row
    2. Add a column for hurricane ID based on column X1
    3. Add a column for hurricane Name based on column X2
    4. Add a column for hurricane number of observations based on column X3
    5. Fill in the empty values for the 3 new columns
    6. Filter out the rows where X1 is not a date
    7. Separate the comma-delimited values in column X4 into new columns

    Here's a tidyverse approach to achieve the above steps:

       library(dplyr)
       library(tidyr)
       library(stringr)
       library(readr)
    
       hurdat2_1851_2021_041922 <- read_csv("https://www.nhc.noaa.gov/data/hurdat/hurdat2-1851-2021-041922.txt", 
                                            col_names = FALSE)
    
       hurdat2 <- hurdat2_1851_2021_041922 %>% 
         mutate(ID = ifelse(str_detect(X1, "^[A-Z]"), X1, NA), 
                Name = ifelse(str_detect(X2, "^[A-Z]"), X2, NA), 
                N_obs = ifelse(str_detect(X3, "^\\d+$"), X3, NA)) %>% 
         fill(ID, Name, N_obs) %>% 
         filter(!str_detect(X1, "^[A-Z]")) %>% 
         separate(X4, sep = ",\\s+", into = paste0("X", 4:21))
    

    And here is the result (first few lines):

       X1       X2    X3    X4    X5    X6    X7    X8    X9    X10   X11   X12   X13   X14   X15   X16   X17   X18   X19   X20   X21   ID       Name    N_obs
       <chr>    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>    <chr>   <chr>
     1 18510625 0000  NA    HU    28.0N 94.8W 80    -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  AL011851 UNNAMED 14   
     2 18510625 0600  NA    HU    28.0N 95.4W 80    -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  AL011851 UNNAMED 14   
     3 18510625 1200  NA    HU    28.0N 96.0W 80    -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  AL011851 UNNAMED 14   
     4 18510625 1800  NA    HU    28.1N 96.5W 80    -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  AL011851 UNNAMED 14   
     5 18510625 2100  L     HU    28.2N 96.8W 80    -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  AL011851 UNNAMED 14   
     6 18510626 0000  NA    HU    28.2N 97.0W 70    -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  AL011851 UNNAMED 14   
     7 18510626 0600  NA    TS    28.3N 97.6W 60    -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  AL011851 UNNAMED 14   
     8 18510626 1200  NA    TS    28.4N 98.3W 60    -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  AL011851 UNNAMED 14   
     9 18510626 1800  NA    TS    28.6N 98.9W 50    -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  AL011851 UNNAMED 14   
    10 18510627 0000  NA    TS    29.0N 99.4W 50    -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  -999  AL011851 UNNAMED 14
    

    Finally, you will want to rename the columns and perhaps convert some of them to the correct type e.g. Date for X1. I'm not familiar with the data structure so I'll leave that for you.