pythonscikit-learncluster-analysissvmscikits

How to print clusters of SVM in python


I want to classify rows of a column using SVM clustering method. I can find so many content on net which produces graphs or print prediction accuracy but i cannot find ways to print my cluster. Below example will better explain what i am trying to do:

I have a dataframe to be used as test dataset

import pandas as pd
train_data = {'Serial': [1,2,3,4,5,6,7,8,9,10],
        'Text': ['Dog is a faithful animal',cat are not reliable','Tortoise can live a long life',
        'camel stores water in its hump','horse are used as means of transport','pen is a powerful weapon',
        'stop when the signal is red','oxygen is a life gas','chocolates are bad for health','lets grab a cup of coffee'],
        'classification':['Animal','Animal','Animal','Animal','Animal','Thing','Thing','Miscellenous','Thing','Thing']
        }

df = pd.DataFrame(train_data, columns = ['Serial', 'Text', 'classification'])
print (df)

I want to predict whether the text row is talking about Animal/Thing or miscelleneus. The test data i want to pass is

test_data = {'Serial': [1,2,3,4,5],
        'Text': ['Is this your dog?','Lets talk about the problem','You have a cat eye',
        'Donot forget to take the camel ride when u goto dessert','Plants give us O2']
        }

df = pd.DataFrame(test_data, columns = ['Serial', 'Text'])

Expected result is an additional column 'Classification' getting created in the test dataframe with values ['Animal','Miscellenous','Animal','Animal','Miscellenous']


Solution

  • Here is the solution to your problem:

    # import tfidf-vectorizer
    from sklearn.feature_extraction.text import TfidfVectorizer
    # import support vector classifier
    from sklearn.svm import SVC 
    import pandas as pd
    
    train_data = {'Serial': [1,2,3,4,5,6,7,8,9,10],
            'Text': ['Dog is a faithful animal','cat are not reliable','Tortoise can live a long life',
            'camel stores water in its hump','horse are used as means of transport','pen is a powerful weapon',
            'stop when the signal is red','oxygen is a life gas','chocolates are bad for health','lets grab a cup of coffee'],
            'classification':['Animal','Animal','Animal','Animal','Animal','Thing','Thing','Miscellenous','Thing','Thing']
            }
    
    train_df = pd.DataFrame(train_data, columns = ['Serial', 'Text', 'classification'])
    display(train_df)
    
    
    test_data = {'Serial': [1,2,3,4,5],
            'Text': ['Is this your dog?','Lets talk about the problem','You have a cat eye',
            'Donot forget to take the camel ride when u goto dessert','Plants give us O2']
            }
    
    test_df = pd.DataFrame(test_data, columns = ['Serial', 'Text'])
    display(test_df)
    
    
    # Load training data (text) from the dataframe and form to a list containing all the entries
    training_data = train_df['Text'].tolist()
    
    # Load training labels from the dataframe and form to a list as well
    training_labels = train_df['classification'].tolist()
    
    # Load testing data from the dataframe and form a list
    testing_data = test_df['Text'].tolist()
    
    # Get a tfidf vectorizer to process the text into vectors
    vectorizer = TfidfVectorizer()
    
    # Fit the tfidf-vectorizer to training data and transform the training text into vectors
    X_train = vectorizer.fit_transform(training_data)
    
    # Transform the testing text into vectors
    X_test = vectorizer.transform(testing_data)
    
    # Get the SVC classifier
    clf = SVC()
    
    # Train the SVC with the training data (data points and labels)
    clf.fit(X_train, training_labels)
    
    # Predict the test samples
    print(clf.predict(X_test))
    
    # Add classification results to test dataframe
    test_df['Classification'] = clf.predict(X_test)
    
    # Display test dataframe
    display(test_df)
    

    As an explanation for the approach:

    You have your training data and want to use it to train a SVM and then predict the test data with labels.

    That means you need to extract the training data and labels for each data point (so for each phrase, you need to know if its an animal or a thing etc.) and then you need to set up and train a SVM. Here, I used the implementation from scikit-learn.

    Moreover you can't just train the SVM with raw text data, because it requires numerical values (numbers). This means you need to transform the text data into numbers. This is "feature extraction from text" and for this one of the common approaches is to use the Term-Frequency Inverted-Document-Frequency (TF-IDF) concept.

    Now you can use a vector representation of each phrase coupled with a label for it to train the SVM and then use it to classify the test data :)

    In short the steps are:

    1. Extract data points and labels from training
    2. Extract data points from testing
    3. Set up SVM classifier
    4. Set up TF-IDF vectorizer and fit it to training data
    5. Transform training data and testing data with tf-idf vectorizer
    6. Train the SVM classifier
    7. Classify test data with trained classifier

    I hope this helps!