machine-learningpredictionibm-watsonpredictionio

Predict text fields based on text fields


I am working on machine learning and prediction for about a month. I have tried IBM watson with bluemix, Amazon machine learning, and predictionIO. What I want to do is to predict a text field based on other fields. My CSV file have four text fields named Question,Summary,Description,Answer and about 4500 lines/Recrods. No numerical fields are in the uploaded dataset. A typical record looks like below.

{'Question':'sys down','Summary':'does not boot after OS update','Description':'Desktop does not boot','Answer':'Switch to safemode and rollback last update'}

On IBM watson I found a question in their forums and a reply that custom corpus upload is not possible right now. Then I moved to Amazon machine learning. I followed their documentation and was able to implement prediction in a custom app using API. I tested on movielens data and everything was numerical. I successfully uploaded data and got movie recommendations with their python-boto library. When I tried uploading my CSV file The problem I had was that no text field can be selected as target. Then I added numerical values corresponds to each value in CSV.This approcah made prediction successful but the accuracy was not right. May be the CSV had to be formatted in a better way.

A record from the movielens data is pasted below. It says that userID 196 gave movieID 242 a two star rating at time (Unix timestamp) 881250949.

196 242 3   881250949

Currently I am trying predictionIO. A test on movielens database was run successfully without issues as told in the documentation using recommendation template. But still its unclear the possibilities of predicting a text field based on other text fields.

Does prediction run on numerical Fields only or a text field can be predicted based on other text fields?


Solution

  • No, prediction does not only run on numerical fields. It could be anything including text. My guess is that the MovieLens data uses ID instead of actual user and movie names because

    1. this saves storage space (this dataset is there for a long time and back then storage is definitely a concern), and

    2. there is no need to know the actual user name (privacy concern)

    For your case, you might want to look at the text classification template https://docs.prediction.io/demo/textclassification/ . You will need to model how you want each record to be classified.