pythonalgorithmmachine-learningartificial-intelligencescikit-learn

Using ranking data in Logistic Regression


I am trying to use some ranking data in a logistic regression. I want to use machine learning to make a simple classifier as to whether a webpage is "good" or not. It's just a learning exercise so I don't expect great results; just hoping to learn the "process" and coding techniques.

I have put my data in a .csv as follows :

URL WebsiteText AlexaRank GooglePageRank

In my Test CSV we have :

URL WebsiteText AlexaRank GooglePageRank Label

Label is a binary classification indicating "good" with 1 or "bad" with 0.

I currently have my LR running using only the website text; which I run a TF-IDF on.

I have a two questions which I need help with:


Solution

  • I guess sklearn.preprocessing.StandardScaler would be the first thing you want to try. StandardScaler transforms all of your features into Mean-0-Std-1 features.

    Here is how you can scale the X matrix.

    sc = proprocessing.StandardScaler().fit(X)
    X = sc.transform(X)
    

    Don't forget to use same scaler to transform X_test.

    X_test = sc.transform(X_test)
    

    Now you can use the fitting procedure etc.

    rd.fit(X, y)
    re.predict_proba(X_test)
    

    Check this out for more on sklearn preprocessing: http://scikit-learn.org/stable/modules/preprocessing.html

    Edit: Parsing and column merging part can be easily done using pandas, i.e., there is no need to convert the matrices into list and then append them. Moreover, pandas dataframes can be directly indexed by their column names.

    AlexaAndGoogleTrainData = p.read_table('train.tsv', header=0)[["AlexaRank", "GooglePageRank"]]
    AlexaAndGoogleTestData = p.read_table('test.tsv', header=0)[["AlexaRank", "GooglePageRank"]]
    AllAlexaAndGoogleInfo = AlexaAndGoogleTestData.append(AlexaAndGoogleTrainData)
    

    Note that we are passing header=0 argument to read_table to maintain original header names from tsv file. And also note how we can index using entire set of columns. Finally, you can stack this new matrix with X using numpy.hstack.

    X = np.hstack((X, AllAlexaAndGoogleInfo))
    

    hstack horizontally combined two multi-dimensional array-like structures provided their lengths are same.