rdata-manipulationdata-wranglingreformatting

Parse and import metadata into R


I have a file containing metadata of amazon products, structured like this:

Id:   0
ASIN: 0771044445
  discontinued product

Id:   1
ASIN: 0827229534
  title: Patterns of Preaching: A Sermon Sampler
  group: Book
  salesrank: 396585
  similar: 5  0804215715  156101074X  0687023955  0687074231  082721619X
  categories: 2
   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]
   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]
  reviews: total: 2  downloaded: 2  avg rating: 5
    2000-7-28  cutomer: A2JW67OY8U6HHK  rating: 5  votes:  10  helpful:   9
    2003-12-14  cutomer: A2VE83MZF98ITY  rating: 5  votes:   6  helpful:   5

Id:   2
ASIN: 0738700797
  title: Candlemas: Feast of Flames
  group: Book
  salesrank: 168596
  similar: 5  0738700827  1567184960  1567182836  0738700525  0738700940
  categories: 2
   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Wicca[12484]
   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Witchcraft[12486]
  reviews: total: 12  downloaded: 12  avg rating: 4.5
    2001-12-16  cutomer: A11NCO6YTE4BTJ  rating: 5  votes:   5  helpful:   4
    2002-1-7  cutomer:  A9CQ3PLRNIR83  rating: 4  votes:   5  helpful:   5
    2002-1-24  cutomer: A13SG9ACZ9O5IM  rating: 5  votes:   8  helpful:   8
    2002-1-28  cutomer: A1BDAI6VEYMAZA  rating: 5  votes:   4  helpful:   4
    2002-2-6  cutomer: A2P6KAWXJ16234  rating: 4  votes:  16  helpful:  16
    2002-2-14  cutomer:  AMACWC3M7PQFR  rating: 4  votes:   5  helpful:   5
    2002-3-23  cutomer: A3GO7UV9XX14D8  rating: 4  votes:   6  helpful:   6
    2002-5-23  cutomer: A1GIL64QK68WKL  rating: 5  votes:   8  helpful:   8
    2003-2-25  cutomer:  AEOBOF2ONQJWV  rating: 5  votes:   8  helpful:   5
    2003-11-25  cutomer: A3IGHTES8ME05L  rating: 5  votes:   5  helpful:   5
    2004-2-11  cutomer: A1CP26N8RHYVVO  rating: 1  votes:  13  helpful:   9
    2005-2-7  cutomer:  ANEIANH0WAT9D  rating: 5  votes:   1  helpful:   1

I found a csv containing the same data exactly how I would like it to be, made as follows:

"id","title","group","salesrank","review_cnt","downloads","rating"
"1","Patterns of Preaching: A Sermon Sampler","Book",396585,2,2,5
"2","Candlemas: Feast of Flames","Book",168596,12,12,4.5

Although I do have the file I need, I would like to know how to generate it by myself, preferibly by using R, in order to import the data as a dataframe.

Thank you.


Solution

  • You could try:

    fields <- c("Id", "title", "group", "salesrank", "total", "downloaded")
    text <- readLines("mydata.txt")# The file containing the amazon metadata
    a <- grep('(Id|title|reviews|salesrank|group):', text, value = TRUE)
    b <- gsub('reviews:|avg', '', a)
    d <- trimws(gsub("(downloaded|rating)", "\n\\1", b))
    e <- do.call(rbind.data.frame, tapply(d, cumsum(grepl('Id', d)), function(x) 
      read.dcf(textConnection(x), fields = fields)))
    type.convert(e, as.is = TRUE)
    
     Id                                   title group salesrank total downloaded
    1  0                                    <NA>  <NA>        NA    NA         NA
    2  1 Patterns of Preaching: A Sermon Sampler  Book    396585     2          2
    3  2              Candlemas: Feast of Flames  Book    168596    12         12