rdatabasetextmining

Extract information from a structured text file to create a dataframe in R


I need to organize the information from a long (and old) text file containing thousands of items into a dataframe. The information in the text file follows the same structure in all the items. My goal is to arrange each item in a different row of the dataframe.

Structure of the text file:

Title (number of books) Country
Date time (author) Page number CODES letter,letter...
Notes

An example of the content, showing the first 3 items:

Pride and Prejudice (5) United Kingdom
1981 10:23 h (Jane Austen) Page 241 CODES OB,IT,CA
Deposited by the G.M.W.

Brave New World (2) United Kingdom
1977 09:14 h (Aldous Huxley) Page 205 CODES OB,PU
Deposited by the E.L.

Wide Sargasso Sea  (1) Jamaica
1989 16:51 h (Jean Rhys) Page 183 CODES OB,CA
Sent to the N.U.C.

I need to extract the first 6 elements of each item (title, number, country, date, time, author) and ignore the rest. The desired dataframe would be:

Title NoBooks Country Date time Author
Pride and Prejudice 5 United Kingdom 1981 10:23 Jane Austen
Brave New World 2 United Kingdom 1977 09:14 JAldous Huxley
Wide Sargasso Sea 1 Jamaica 1989 16:51 Jean Rhys

I have just found two similar posts (converting multiple lines of text into a data frame and Converting text file into dataframe in R) but my database doesn't have key characters to be used as separators.

Is there a way to separate my elemets? I've found a solution using Python libraries, but I would like to do it with R. Any suggestions?


Solution

  • Hope this could help you.

    p.d. some column data types could be cast to numeric of date since these are all text.

    data<-"Pride and Prejudice (5) United Kingdom
    1981 10:23 h (Jane Austen) Page 241 CODES OB,IT,CA
    Deposited by the G.M.W.
    
    Brave New World (2) United Kingdom
    1977 09:14 h (Aldous Huxley) Page 205 CODES OB,PU
    Deposited by the E.L.
    
    Wide Sargasso Sea  (1) Jamaica
    1989 16:51 h (Jean Rhys) Page 183 CODES OB,CA
    Sent to the N.U.C."
    
    con <- textConnection(data, "r") # replace with:  con <- file("yourfile.txt") 
    data <- readLines(con)
    close(con)
    
    l1 <- data[seq(1,length(data), 4)]
    l2 <- data[seq(2,length(data), 4)]
    
    d1 <- regmatches(l1, regexec("^(.*) \\((\\d+)\\) (.*)", l1 ))
    d2 <- regmatches(l2, regexec("^(\\d{4}) (\\d{2}:\\d{2}) h \\((.*)\\)", l2))
    df <- as.data.frame(do.call(rbind, mapply(c, d1, d2, SIMPLIFY = F))[,c(-1,-5)])
    
    colnames(df) <- c("Title","NoBooks","Country","Date","time","Author")
    
    df
    #>                 Title NoBooks        Country Date  time        Author
    #> 1 Pride and Prejudice       5 United Kingdom 1981 10:23   Jane Austen
    #> 2     Brave New World       2 United Kingdom 1977 09:14 Aldous Huxley
    #> 3  Wide Sargasso Sea        1        Jamaica 1989 16:51     Jean Rhys