rread.table

How to read a poorly formated data text file in R?


I have a data text file(17 columns) that i want to read in R. I'm using the read.table() function.

read.table(file="data.txt", header = TRUE, sep = "\t", quote = "",comment.char="")

The problem is that some of the rows take multiple lines(example below)

10  Macron serait-il plus pro-salafiste que Hamon?!
t.co/g29oOgqih1

#Presidentielle2017 FALSE   0   NA  2017-03-02 13:45:08 FALSE   NA  837297724378726400  NA  <a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a> Trader496   0   FALSE   FALSE   NA  NA

Is there any way to read this type of data in a single row or do i have to use fill=TRUE

Data File: https://pastebin.com/b90VHvSt


Solution

  • The readr::melt_*() or meltr::melt_*() functions are useful for misformatted data. This can be a very tedious task, so I'll demonstrate some of the functionality and workflow without completely cleaning this data.

    This looks like it's tab-separated, so we'll start with melt_tsv():

    library(readr)
    library(dplyr)
    library(tidyr)
    
    data_raw <- melt_tsv("https://pastebin.com/raw/b90VHvSt")
    
    data_raw
    
    #> # A tibble: 281 × 4
    #>      row   col data_type value        
    #>    <dbl> <dbl> <chr>     <chr>        
    #>  1     1     1 character no  text     
    #>  2     1     2 character favorited    
    #>  3     1     3 character favoriteCount
    #>  4     1     4 character replyToSN    
    #>  5     1     5 character created      
    #>  6     1     6 character truncated    
    #>  7     1     7 character replyToSID   
    #>  8     1     8 character id           
    #>  9     1     9 character replyToUID   
    #> 10     1    10 character statusSource 
    #> # … with 271 more rows
    

    This reads in the data one token at a time, with information on location and data type. For starters, it looks like the first two column names are separated by spaces instead of a tab, so were read in as one token. We can fix this, then merge in the corrected headers to the rest of the data.

    headers_fixed <- data_raw %>% 
      filter(row == 1, col != 1) %>% 
      mutate(col = col + 1) %>% 
      select(col, col_name = value) %>% 
      add_row(col = c(1, 2), col_name = c("no", "text"), .before = 1)
    
    data_raw <- data_raw %>% 
      filter(row != 1) %>% 
      left_join(headers_fixed) %>% 
      add_count(row, name = "row_cols")
    

    I also added a count variable showing the number of columns per row. Each row should have 17 columns, so we can use this to filter and pivot the "good" rows.

    data_ok <- data_raw %>% 
      filter(row_cols == 17) %>% 
      select(row, col_name, value) %>% 
      pivot_wider(names_from = col_name) %>% 
      type_convert()
    
    data_ok
    
    #> # A tibble: 12 × 18
    #>      row    no text  favor…¹ favor…² reply…³ created             trunc…⁴ reply…⁵
    #>    <dbl> <dbl> <chr> <lgl>     <dbl> <lgl>   <dttm>              <lgl>   <lgl>  
    #>  1     2     1 "RT … FALSE         0 NA      2017-03-02 13:45:34 FALSE   NA     
    #>  2     3     2 "Ne … FALSE         0 NA      2017-03-02 13:45:32 FALSE   NA     
    #>  3     4     3 "Il … FALSE         0 NA      2017-03-02 13:45:29 FALSE   NA     
    #>  4     5     4 "RT … FALSE         0 NA      2017-03-02 13:45:26 FALSE   NA     
    #>  5     6     5 "RT … FALSE         0 NA      2017-03-02 13:45:26 FALSE   NA     
    #>  6     7     6 "RT … FALSE         0 NA      2017-03-02 13:45:25 FALSE   NA     
    #>  7     8     7 "RT … FALSE         0 NA      2017-03-02 13:45:13 FALSE   NA     
    #>  8     9     8 "#Pr… FALSE         0 NA      2017-03-02 13:45:10 FALSE   NA     
    #>  9    10     9 "#Pr… FALSE         0 NA      2017-03-02 13:45:10 FALSE   NA     
    #> 10    16    11 "RT … FALSE         0 NA      2017-03-02 13:44:58 FALSE   NA     
    #> 11    21    13 "RT … FALSE         0 NA      2017-03-02 13:44:46 FALSE   NA     
    #> 12    26    15 "Dim… FALSE         0 NA      2017-03-02 13:44:41 FALSE   NA     
    #> # … with 9 more variables: id <dbl>, replyToUID <lgl>, statusSource <chr>,
    #> #   screenName <chr>, retweetCount <dbl>, isRetweet <lgl>, retweeted <lgl>,
    #> #   longitude <lgl>, latitude <lgl>, and abbreviated variable names ¹​favorited,
    #> #   ²​favoriteCount, ³​replyToSN, ⁴​truncated, ⁵​replyToSID
    

    This leaves us with 61 values in 13 "bad" rows. Diagnosing and fixing these will take more work, which is left as an exercise for the reader.

    data_bad <- data_raw %>% 
      filter(row_cols != 17) 
    
    data_bad
    
    #> # A tibble: 61 × 6
    #>      row   col data_type value                                   col_n…¹ row_c…²
    #>    <dbl> <dbl> <chr>     <chr>                                   <chr>     <int>
    #>  1    11     1 integer   10                                      no            2
    #>  2    11     2 character Macron serait-il plus pro-salafiste qu… text          2
    #>  3    12     1 character <url shortener rmvd>                    no            1
    #>  4    13     1 missing   <NA>                                    no            1
    #>  5    14     1 missing   <NA>                                    no            1
    #>  6    15     1 character #Presidentielle2017                     no           16
    #>  7    15     2 logical   FALSE                                   text         16
    #>  8    15     3 integer   0                                       favori…      16
    #>  9    15     4 missing   <NA>                                    favori…      16
    #> 10    15     5 datetime  2017-03-02 13:45:08                     replyT…      16
    #> # … with 51 more rows, and abbreviated variable names ¹​col_name, ²​row_cols
    

    Created on 2022-11-09 with reprex v2.0.2