data-science-experience

How to handle input file with non standard delimiters in dsx ml pipeline?


I'm trying to work with a data set that has no header and has :: for field delimiters:

! wget --quiet http://files.grouplens.org/datasets/movielens/ml-1m.zip
! unzip ml-1m.zip
! mv ml-1m/ratings.dat .
! head ratings.dat

The output:

1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968

I have loaded the file into my dsx pipeline, but I am unclear how to get dsx to split this file using the :: delimiters.

Update:

The ml pipeline functionality I'm trying to use can be seen from the screenshot below:

enter image description here

I have added a data set, but can't figure out how to get dsx to recognise the field delimiters:

enter image description here


Solution

  • As of Feb-2017...

    When you create a new pipeline and select a dataset, I believe DSX loads the file you select using a Spark DataFrameReader. The DataFrameReader defaults to using a single , as the delimiter. DSX does not provide a way to change the default delimiter in the UI.

    I think preprocessing the data is your best option. You can do this in a notebook. Be aware that the Spark DataFrameReader only supports a single character delimiter, so you can't use that with this particular dataset. You can user pandas, however.

    import pandas as pd
    
    pdf = pd.read_csv('ml-1m/ratings.dat', sep='::', 
                  header=None, 
                  names=['UserID','MovieID','Rating','Timestamp'], 
                  engine='python')
    
    pdf.to_csv('ratings.csv', index=False)
    
    !head ratings.csv
    UserID,MovieID,Rating,Timestamp
    1,1193,5,978300760
    1,661,3,978302109
    1,914,3,978301968
    1,3408,4,978300275
    1,2355,5,978824291
    1,1197,3,978302268
    1,1287,5,978302039
    1,2804,5,978300719
    1,594,4,978302268
    

    Now the data will be in a format that DSX will be able to parse properly.