We have had a production web based product that allows users to make predictions about the future value (or demand) of goods, the historical data contains about 100k examples, each example has about 5 parameters;
Consider a class of data called a prediciton:
prediction {
id: int
predictor: int
predictionDate: date
predictedProductId: int
predictedDirection: byte (0 for decrease, 1 for increase)
valueAtPrediciton: float
}
and a paired result class that measures the result of the prediction:
predictionResult {
id: int
valueTenDaysAfterPrediction: float
valueTwentyDaysAfterPrediction: float
valueThirtyDaysAfterPrediction: float
}
we can define a test case such for success, where if any two of the future value check points are favorable when conisdering direction and value at the time of prediction.
success(p: prediction, r: predictionResult): bool =
count: int
count = 0
// value is predicted to fall
if p.predictedDirection = 0 then
if p.valueAtPrediciton > r.valueTenDaysAfterPrediction then count = count + 1
if p.valueAtPrediciton > r.valueTwentyDaysAfterPrediction then count = count + 1
if p.valueAtPrediciton > r.valueThirtyDaysAfterPrediction then count = count + 1
// value is predicted to increase
else
if p.valueAtPrediciton < r.valueTenDaysAfterPrediction then count = count + 1
if p.valueAtPrediciton < r.valueTwentyDaysAfterPrediction then count = count + 1
if p.valueAtPrediciton < r.valueThirtyDaysAfterPrediction then count = count + 1
// success if count = 2 or count = 3
return (count > 1)
Everything in the prediction class is known the moment the user submits the form, and the information in the predictionResult is not known until later; Ideally the model or algorythm can be derived from our three year history that algorythm is applied to a new prediciton we can get a probability as to whether it will be a success or not (I would be happy with a boolean Y/N flag as to wether this is interesting or not).
Could I have some guidance so I can research and practice exactly what I need to solve a problem like this?
Features
The first thing you'll need to do is decide what information you'll use as evidence to classify a user's prediction as being accurate or not. For example, you could start with simple stuff like the identity of the user making the prediction, and their historical accuracy when making predictions on the same or similar goods. This information will be provided to downstream machine learning tools as features that will be used to classify the users' predictions.
Training, Development, and Test Data
You'll want to split your 100k historical examples into three parts: training, development, and test. You should put most of the data, say 80% of it, in your training set. This will be the dataset you use to train your prediction accuracy classifier. Generally speaking the more data you use to train your classifier the more accurate the resulting model will be.
The two other data sets, development and test, will be used to evaluate the performance of your classifier. You'll use the development set to evaluate the accuracy of different configurations of your classifier or variations in the feature representation. It's called the development set since you use it to continuously evaluate classification performance as you develop your model or system.
Later, after you've built a model that achieves good performance on the development data, you'll probably want an unbiased estimated of how well your classifier will perform on new data. For this you'll use the test set to evaluate how well the classifier does on data other than what you used to develop it.
Classifier/ML Packages
After you have your preliminary feature set and you've split the data into training, development, and test, you're ready to choose a machine learning package and classifier. A few good packages that support numerous types of classifiers include:
Which classifier you should use depends on many factors including what kind of predictions you'd like to make (e.g., binary, multiclass), what kinds of features you'd like to use, and the amount of training data you want to use.
For example, if you just what to make a binary classification of whether a user's predication is probably accurate or not, you might want to try support-vector-machines (SVMs). Their basic formulation is limited to doing binary predications. But, if that is all you need, they are often a good choice since they can result in very accurate models.
However, the time required to train a SVM scales poorly with the size of the training data. To train on substantial amounts data, you might decide to use something like random forests. When random forests and SVMs are trained on the same size data sets, random forests will typically produce a model that is either as accurate or nearly as accurate as a SVM model. However, random forests can allow you to use more training data and using more training data will typically increase the accuracy of your model.
Digging Deeper
Here are a few pointers to other good places to get started with machine learning