I need to organize the information from a long (and old) text file containing thousands of items into a dataframe. The information in the text file follows the same structure in all the items. My goal is to arrange each item in a different row of the dataframe.
Structure of the text file:
Title (number of books) Country
Date time (author) Page number CODES letter,letter...
Notes
An example of the content, showing the first 3 items:
Pride and Prejudice (5) United Kingdom
1981 10:23 h (Jane Austen) Page 241 CODES OB,IT,CA
Deposited by the G.M.W.
Brave New World (2) United Kingdom
1977 09:14 h (Aldous Huxley) Page 205 CODES OB,PU
Deposited by the E.L.
Wide Sargasso Sea (1) Jamaica
1989 16:51 h (Jean Rhys) Page 183 CODES OB,CA
Sent to the N.U.C.
I need to extract the first 6 elements of each item (title, number, country, date, time, author) and ignore the rest. The desired dataframe would be:
Title | NoBooks | Country | Date | time | Author |
---|---|---|---|---|---|
Pride and Prejudice | 5 | United Kingdom | 1981 | 10:23 | Jane Austen |
Brave New World | 2 | United Kingdom | 1977 | 09:14 | JAldous Huxley |
Wide Sargasso Sea | 1 | Jamaica | 1989 | 16:51 | Jean Rhys |
I have just found two similar posts (converting multiple lines of text into a data frame and Converting text file into dataframe in R) but my database doesn't have key characters to be used as separators.
Is there a way to separate my elemets? I've found a solution using Python libraries, but I would like to do it with R. Any suggestions?
Hope this could help you.
p.d. some column data types could be cast to numeric of date since these are all text.
data<-"Pride and Prejudice (5) United Kingdom
1981 10:23 h (Jane Austen) Page 241 CODES OB,IT,CA
Deposited by the G.M.W.
Brave New World (2) United Kingdom
1977 09:14 h (Aldous Huxley) Page 205 CODES OB,PU
Deposited by the E.L.
Wide Sargasso Sea (1) Jamaica
1989 16:51 h (Jean Rhys) Page 183 CODES OB,CA
Sent to the N.U.C."
con <- textConnection(data, "r") # replace with: con <- file("yourfile.txt")
data <- readLines(con)
close(con)
l1 <- data[seq(1,length(data), 4)]
l2 <- data[seq(2,length(data), 4)]
d1 <- regmatches(l1, regexec("^(.*) \\((\\d+)\\) (.*)", l1 ))
d2 <- regmatches(l2, regexec("^(\\d{4}) (\\d{2}:\\d{2}) h \\((.*)\\)", l2))
df <- as.data.frame(do.call(rbind, mapply(c, d1, d2, SIMPLIFY = F))[,c(-1,-5)])
colnames(df) <- c("Title","NoBooks","Country","Date","time","Author")
df
#> Title NoBooks Country Date time Author
#> 1 Pride and Prejudice 5 United Kingdom 1981 10:23 Jane Austen
#> 2 Brave New World 2 United Kingdom 1977 09:14 Aldous Huxley
#> 3 Wide Sargasso Sea 1 Jamaica 1989 16:51 Jean Rhys