How to handle input file with non standard delimiters in dsx ml pipeline?

I'm trying to work with a data set that has no header and has :: for field delimiters:

! wget --quiet http://files.grouplens.org/datasets/movielens/ml-1m.zip
! unzip ml-1m.zip
! mv ml-1m/ratings.dat .
! head ratings.dat

The output:

1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968

I have loaded the file into my dsx pipeline, but I am unclear how to get dsx to split this file using the :: delimiters.

How do I do this?
If it is not possible to get dsx to reshape this file using dsx ml pipeline functionality, does dsx have any pre-requisities in terms of input file format?

Update:

The ml pipeline functionality I'm trying to use can be seen from the screenshot below:

I have added a data set, but can't figure out how to get dsx to recognise the field delimiters:

Solution

As of Feb-2017...

When you create a new pipeline and select a dataset, I believe DSX loads the file you select using a Spark DataFrameReader. The DataFrameReader defaults to using a single , as the delimiter. DSX does not provide a way to change the default delimiter in the UI.

I think preprocessing the data is your best option. You can do this in a notebook. Be aware that the Spark DataFrameReader only supports a single character delimiter, so you can't use that with this particular dataset. You can user pandas, however.

import pandas as pd

pdf = pd.read_csv('ml-1m/ratings.dat', sep='::', 
              header=None, 
              names=['UserID','MovieID','Rating','Timestamp'], 
              engine='python')

pdf.to_csv('ratings.csv', index=False)

!head ratings.csv
UserID,MovieID,Rating,Timestamp
1,1193,5,978300760
1,661,3,978302109
1,914,3,978301968
1,3408,4,978300275
1,2355,5,978824291
1,1197,3,978302268
1,1287,5,978302039
1,2804,5,978300719
1,594,4,978302268

Now the data will be in a format that DSX will be able to parse properly.