rparsingdatasetspacesread.table

R read data from a txt space delimited file with quoted text


I'm trying to load a dataset into R Studio, where the dataset itself is space-delimited, but it also contains spaces in quoted text like in csv files. Here is the head of the data:

DOC_ID  LABEL   RATING  VERIFIED_PURCHASE   PRODUCT_CATEGORY    PRODUCT_ID  PRODUCT_TITLE   REVIEW_TITLE    REVIEW_TEXT
1   __label1__  4   N   PC  B00008NG7N  "Targus PAUK10U Ultra Mini USB Keypad, Black"   useful  "When least you think so, this product will save the day. Just keep it around just in case you need it for something."
2   __label1__  4   Y   Wireless    B00LH0Y3NM  Note 3 Battery : Stalion Strength Replacement 3200mAh Li-Ion Battery for Samsung Galaxy Note 3 [24-Month Warranty] with NFC Chip + Google Wallet Capable    New era for batteries   Lithium batteries are something new introduced in the market there average developing cost is relatively high but Stallion doesn't compromise on quality and provides us with the best at a low cost.<br />There are so many in built technical assistants that act like a sensor in their particular forté. The battery keeps my phone charged up and it works at every voltage and a high voltage is never risked.
3   __label1__  3   N   Baby    B000I5UZ1Q  "Fisher-Price Papasan Cradle Swing, Starlight"  doesn't swing very well.    "I purchased this swing for my baby. She is 6 months now and has pretty much out grown it. It is very loud and doesn't swing very well. It is beautiful though. I love the colors and it has a lot of settings, but I don't think it was worth the money."
4   __label1__  4   N   Office Products B003822IRA  Casio MS-80B Standard Function Desktop Calculator   Great computing!    I was looking for an inexpensive desk calcolatur and here it is. It works and does everything I need. Only issue is that it tilts slightly to one side so when I hit any keys it rocks a little bit. Not a big deal.
5   __label1__  4   N   Beauty  B00PWSAXAM  Shine Whitening - Zero Peroxide Teeth Whitening System - No Sensitivity Only use twice a week   "I only use it twice a week and the results are great. I have used other teeth whitening solutions and most of them, for the same results I would have to use it at least three times a week. Will keep using this because of the potency of the solution and also the technique of the trays, it keeps everything in my teeth, in my mouth."
6   __label1__  3   N   Health & Personal Care  B00686HNUK  Tobacco Pipe Stand - Fold-away Portable - Light Weight - For Single Pipe    not sure    I'm not sure what this is supposed to be but I would recommend that you do a little more research into the culture of using pipes if you plan on giving this as a gift or using it yourself.
7   __label1__  4   N   Toys    B00NUG865W  ESPN 2-Piece Table Tennis   PING PONG TABLE GREAT FOR YOUTHS AND FAMILY "Pleased with ping pong table. 11 year old and 13 year old having a blast, plus lots of family entertainment too. Plus better than kids sitting on video games all day. A friend put it together. I do believe that was a challenge, but nothing they could not handle"
8   __label1__  4   Y   Beauty  B00QUL8VX6  "Abundant Health 25% Vitamin C Serum with Vitamin E and Hyaluronic Acid for Youthful Looking Skin, 1 fl. oz."   Great vitamin C serum   "Great vitamin C serum... I really like the oil feeling, not too sticky. I used it last week on some of my recent bug bites and it helps heal the skin faster than normal."
9   __label1__  4   N   Health & Personal Care  B004YHKVCM  PODS Spring Meadow HE Turbo Laundry Detergent Pacs 77-load Tub  wonderful detergent.    "I've used tide pods laundry detergent for many years,its such a great detergent to use having a nice scent and leaver the cloths smelling fresh."

Problem is that it looks tab-delimited but it is not, example would be DOC_ID = 1, where there are only two spaces between useful and "When least...", this way passing sep = "/t" to read.table throws an error saying that line 1 did not have 10 elements, which for some reason is incorrect, because the number of elements should be 9. Here are the parameters that I'm passing(without the original path):

read.table(file = "path", sep ="\t", header = TRUE, strip.white = TRUE)

Also relying on quotes is not a good strategy, because some lines do not have their text quoted, so the delimiter should be something like a double space, which combined with strip.white should work properly, but read.table only accepts single byte delimiters.

So the question is how would you parse such corpus in R or with any other third party software that could convert it adequately to a csv or atleast a tab-delimited file?


Solution

  • Parsing the data using python pandas.read_csv(filename, sep='\t', header = 0, ...) seems to have parsed the data successfully and from this point anything could be done with it. Closing this out.