machine-learningsupervised-learningsemisupervised-learning

Text classification for unlabeled data


I want to classify data into two classes based on parameters given. My data is publications from two different sources and I want to classify it into "match" or "non-match"; when comparing the dataset1 with dataset2. The datasets are unlabeled text data which contain five attributes (id, title, authors, venue, year) so if i apply unsupervised algorithms, it will not produce my target classes. On the other hand, supervised algorithms need to labelled data which is unavailable and time consumed.


Solution

  • The best, easiest and AFAIK the optimal method is as follows:

    1. Use clustering algorithms like K-Means, to cluster your data points into 2 clusters.
    2. Now, manually examine a few samples of one of the cluster and label it accordingly.

    Assume you randomly picked 10 data points from the first cluster and they fall in the match class. Now all you need to do is label all the data points in this cluster as match and label all the data points in the other cluster as non-match.

    This would give you the required classification.