I'm trying to work with a data set that has no header and has ::
for field delimiters:
! wget --quiet http://files.grouplens.org/datasets/movielens/ml-1m.zip
! unzip ml-1m.zip
! mv ml-1m/ratings.dat .
! head ratings.dat
The output:
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
I have loaded the file into my dsx pipeline, but I am unclear how to get dsx to split this file using the ::
delimiters.
How do I do this?
If it is not possible to get dsx to reshape this file using dsx ml pipeline functionality, does dsx have any pre-requisities in terms of input file format?
Update:
The ml pipeline functionality I'm trying to use can be seen from the screenshot below:
I have added a data set, but can't figure out how to get dsx to recognise the field delimiters:
As of Feb-2017...
When you create a new pipeline and select a dataset, I believe DSX loads the file you select using a Spark DataFrameReader. The DataFrameReader defaults to using a single ,
as the delimiter. DSX does not provide a way to change the default delimiter in the UI.
I think preprocessing the data is your best option. You can do this in a notebook. Be aware that the Spark DataFrameReader only supports a single character delimiter, so you can't use that with this particular dataset. You can user pandas, however.
import pandas as pd
pdf = pd.read_csv('ml-1m/ratings.dat', sep='::',
header=None,
names=['UserID','MovieID','Rating','Timestamp'],
engine='python')
pdf.to_csv('ratings.csv', index=False)
!head ratings.csv
UserID,MovieID,Rating,Timestamp
1,1193,5,978300760
1,661,3,978302109
1,914,3,978301968
1,3408,4,978300275
1,2355,5,978824291
1,1197,3,978302268
1,1287,5,978302039
1,2804,5,978300719
1,594,4,978302268
Now the data will be in a format that DSX will be able to parse properly.