javanlplingpipe

Can not identify text in Spanish with Lingpipe


Some days ago, I am developing an java server to keep a bunch of data and identify its language, so I decided to use lingpipe for such task. But I have facing an issue, after training code and evaluating it with two languages(English and Spanish) by getting that I can't identify spanish text, but I got a successful result with english and french.

The tutorial that I have followed in order to complete this task is: http://alias-i.com/lingpipe/demos/tutorial/langid/read-me.html

An the next steps I have made in order to complete the task: Steps followed to train a Language Classifier

~1.First place and unpack the english and spanish metadata inside a folder named leipzig, as follow (Note: Metadata and Sentences are provided from http://wortschatz.uni-leipzig.de/en/download):

leipzig       //Main folder
   1M sentences             //Folder with data of the last trial 
     eng_news_2015_1M
     eng_news_2015_1M.tar.gz
     spa-hn_web_2015_1M
     spa-hn_web_2015_1M.tar.gz
   ClassifyLang.java                //Custom program to try the trained code
   dist                                        //Folder
     eng_news_2015_300K.tar.gz              //unpackaged english sentences
     spa-hn_web_2015_300K.tar.gz            //unpackaged spanish sentences
   EvalLanguageId.java
   langid-leipzig.classifier            //trained code
   lingpipe-4.1.2.jar
   munged                                      //Folder
     eng                    //folder containing the sentences.txt for english
        sentences.txt
     spa                    //folder containing the sentences.txt for spanish
        sentences.txt
   Munge.java
   TrainLanguageId.java
   unpacked                                    //Folder
     eng_news_2015_300K         //Folder with the english metadata 
        eng_news_2015_300K-co_n.txt
        eng_news_2015_300K-co_s.txt
        eng_news_2015_300K-import.sql
        eng_news_2015_300K-inv_so.txt
        eng_news_2015_300K-inv_w.txt
        eng_news_2015_300K-sources.txt
        eng_news_2015_300K-words.txt
        sentences.txt
     spa-hn_web_2015_300K                   //Folder with the spanish metadata 
        sentences.txt
        spa-hn_web_2015_300K-co_n.txt
        spa-hn_web_2015_300K-co_s.txt
        spa-hn_web_2015_300K-import.sql
        spa-hn_web_2015_300K-inv_so.txt
        spa-hn_web_2015_300K-inv_w.txt
        spa-hn_web_2015_300K-sources.txt
        spa-hn_web_2015_300K-words.txt

~2.Second unpack the language metadata compressed into a unpack folder

unpacked                                    //Folder
    eng_news_2015_300K          //Folder with the english metadata 
        eng_news_2015_300K-co_n.txt
        eng_news_2015_300K-co_s.txt
        eng_news_2015_300K-import.sql
        eng_news_2015_300K-inv_so.txt
        eng_news_2015_300K-inv_w.txt
        eng_news_2015_300K-sources.txt
        eng_news_2015_300K-words.txt
        sentences.txt
    spa-hn_web_2015_300K                    //Folder with the spanish metadata 
        sentences.txt
        spa-hn_web_2015_300K-co_n.txt
        spa-hn_web_2015_300K-co_s.txt
        spa-hn_web_2015_300K-import.sql
        spa-hn_web_2015_300K-inv_so.txt
        spa-hn_web_2015_300K-inv_w.txt
        spa-hn_web_2015_300K-sources.txt
        spa-hn_web_2015_300K-words.txt

~3.Then Munge the sentences of each one in order to remove the line numbers, tabs and replacing line breaks with single space characters. The output is uniformly written using the UTF-8 unicode encoding (Note:the munge.java at Lingpipe site).

/-----------------Command line----------------------------------------------/

javac -cp lingpipe-4.1.2.jar: Munge.java
java -cp lingpipe-4.1.2.jar: Munge /home/samuel/leipzig/unpacked /home/samuel/leipzig/munged
----------------------------------------Results-----------------------------
spa
reading from=/home/samuel/leipzig/unpacked/spa-hn_web_2015_300K/sentences.txt charset=iso-8859-1
writing to=/home/samuel/leipzig/munged/spa/spa.txt charset=utf-8
total length=43267166

eng
reading from=/home/samuel/leipzig/unpacked/eng_news_2015_300K/sentences.txt charset=iso-8859-1
writing to=/home/samuel/leipzig/munged/eng/eng.txt charset=utf-8
total length=35847257

/---------------------------------------------------------------/

<---------------------------------Folder------------------------------------->
   munged                                      //Folder
    eng                     //folder containing the sentences.txt for english
        sentences.txt
    spa                 //folder containing the sentences.txt for spanish
        sentences.txt
<-------------------------------------------------------------------------->

~4.Next we start by training the language(Note:the TrainLanguageId.java at Lingpipe LanguageId tutorial).

/---------------Command line--------------------------------------------/

javac -cp lingpipe-4.1.2.jar: TrainLanguageId.java
java -cp lingpipe-4.1.2.jar: TrainLanguageId /home/samuel/leipzig/munged /home/samuel/leipzig/langid-leipzig.classifier 100000 5
-----------------------------------Results-----------------------------------
nGram=100000 numChars=5
Training category=eng
Training category=spa

Compiling model to file=/home/samuel/leipzig/langid-leipzig.classifier

/----------------------------------------------------------------------------/

~5. We evaluated our trained code with the next result, having some issues on the confusion matrix (Note:the EvalLanguageId.java at Lingpipe LanguageId tutorial).

/------------------------Command line---------------------------------/

javac -cp lingpipe-4.1.2.jar: EvalLanguageId.java
java -cp lingpipe-4.1.2.jar: EvalLanguageId /home/samuel/leipzig/munged /home/samuel/leipzig/langid-leipzig.classifier 100000 50 1000
-------------------------------Results-------------------------------------

Reading classifier from file=/home/samuel/leipzig/langid-leipzig.classifier
Evaluating category=eng
Evaluating category=spa
TEST RESULTS
BASE CLASSIFIER EVALUATION
Categories=[eng, spa]
Total Count=2000
Total Correct=1000
Total Accuracy=0.5
95% Confidence Interval=0.5 +/- 0.02191346617949794
Confusion Matrix
reference \ response
  ,eng,spa
  eng,1000,0                                <---------- not diagonal sampling
  spa,1000,0
Macro-averaged Precision=NaN
Macro-averaged Recall=0.5
Macro-averaged F=NaN
Micro-averaged Results
         the following symmetries are expected:
           TP=TN, FN=FP
           PosRef=PosResp=NegRef=NegResp
           Acc=Prec=Rec=F
  Total=4000
  True Positive=1000
  False Negative=1000
  False Positive=1000
  True Negative=1000
  Positive Reference=2000
  Positive Response=2000
  Negative Reference=2000
  Negative Response=2000
  Accuracy=0.5
  Recall=0.5
  Precision=0.5
  Rejection Recall=0.5
  Rejection Precision=0.5
  F(1)=0.5
  Fowlkes-Mallows=2000.0
  Jaccard Coefficient=0.3333333333333333
  Yule's Q=0.0
  Yule's Y=0.0
  Reference Likelihood=0.5
  Response Likelihood=0.5
  Random Accuracy=0.5
  Random Accuracy Unbiased=0.5
  kappa=0.0
  kappa Unbiased=0.0
  kappa No Prevalence=0.0
  chi Squared=0.0
  phi Squared=0.0
  Accuracy Deviation=0.007905694150420948
Random Accuracy=0.5
Random Accuracy Unbiased=0.625
kappa=0.0
kappa Unbiased=-0.3333333333333333
kappa No Prevalence =0.0
Reference Entropy=1.0
Response Entropy=NaN
Cross Entropy=Infinity
Joint Entropy=1.0
Conditional Entropy=0.0
Mutual Information=0.0
Kullback-Liebler Divergence=Infinity
chi Squared=NaN
chi-Squared Degrees of Freedom=1
phi Squared=NaN
Cramer's V=NaN
lambda A=0.0
lambda B=NaN

ONE VERSUS ALL EVALUATIONS BY CATEGORY


CATEGORY[0]=eng VERSUS ALL

First-Best Precision/Recall Evaluation
  Total=2000
  True Positive=1000
  False Negative=0
  False Positive=1000
  True Negative=0
  Positive Reference=1000
  Positive Response=2000
  Negative Reference=1000
  Negative Response=0
  Accuracy=0.5
  Recall=1.0
  Precision=0.5
  Rejection Recall=0.0
  Rejection Precision=NaN
  F(1)=0.6666666666666666
  Fowlkes-Mallows=1414.2135623730949
  Jaccard Coefficient=0.5
  Yule's Q=NaN
  Yule's Y=NaN
  Reference Likelihood=0.5
  Response Likelihood=1.0
  Random Accuracy=0.5
  Random Accuracy Unbiased=0.625
  kappa=0.0
  kappa Unbiased=-0.3333333333333333
  kappa No Prevalence=0.0
  chi Squared=NaN
  phi Squared=NaN
  Accuracy Deviation=0.011180339887498949


CATEGORY[1]=spa VERSUS ALL

First-Best Precision/Recall Evaluation
  Total=2000
  True Positive=0
  False Negative=1000
  False Positive=0
  True Negative=1000
  Positive Reference=1000
  Positive Response=0
  Negative Reference=1000
  Negative Response=2000
  Accuracy=0.5
  Recall=0.0
  Precision=NaN
  Rejection Recall=1.0
  Rejection Precision=0.5
  F(1)=NaN
  Fowlkes-Mallows=NaN
  Jaccard Coefficient=0.0
  Yule's Q=NaN
  Yule's Y=NaN
  Reference Likelihood=0.5
  Response Likelihood=0.0
  Random Accuracy=0.5
  Random Accuracy Unbiased=0.625
  kappa=0.0
  kappa Unbiased=-0.3333333333333333
  kappa No Prevalence=0.0
  chi Squared=NaN
  phi Squared=NaN
  Accuracy Deviation=0.011180339887498949

/-----------------------------------------------------------------------/

~6.Then we tried to make a real evaluation with spanish text:

/-------------------Command line----------------------------------/

javac -cp lingpipe-4.1.2.jar: ClassifyLang.java
java -cp lingpipe-4.1.2.jar: ClassifyLang

/-------------------------------------------------------------------------/

<---------------------------------Result------------------------------------>
Text:   Yo soy una persona increíble y muy inteligente, me admiro a mi mismo lo que me hace sentir ansiedad de lo que viene, por que es algo grandioso lleno de cosas buenas y de ahora en adelante estaré enfocado y optimista aunque tengo que aclarar que no lo haré por querer algo, sino por que es mi pasión. 
Best    Language:   eng     <------------- Wrong Result

<----------------------------------------------------------------------->

Code for ClassifyLang.java:

import com.aliasi.classify.Classification;
import com.aliasi.classify.Classified;
import com.aliasi.classify.ConfusionMatrix;
import com.aliasi.classify.DynamicLMClassifier;
import com.aliasi.classify.JointClassification;
import com.aliasi.classify.JointClassifier;
import com.aliasi.classify.JointClassifierEvaluator;
import com.aliasi.classify.LMClassifier;

import com.aliasi.lm.NGramProcessLM;

import com.aliasi.util.AbstractExternalizable;

import java.io.File;
import java.io.IOException;

import com.aliasi.util.Files;

public class ClassifyLang {

    public static String text   =   "Yo soy una persona increíble y muy inteligente, me admiro a mi mismo"
                +   " estoy ansioso de lo que viene, por que es algo grandioso lleno de cosas buenas"
                +   " y de ahora en adelante estaré enfocado y optimista"
                +   " aunque tengo que aclarar que no lo haré por querer algo, sino por que no es difícil serlo.    ";

    private static File MODEL_DIR
        = new File("/home/samuel/leipzig/langid-leipzig.classifier");

    public static void main(String[] args)
        throws ClassNotFoundException, IOException {

    System.out.println("Text:   "   +   text);

    LMClassifier    classifier  =   null;
    try {
        classifier  =   (LMClassifier)  AbstractExternalizable.readObject(MODEL_DIR);
        }   catch   (IOException    |   ClassNotFoundException  ex) {
                    //  Handle  exceptions
            System.out.println("Problem with the Model");
        }

    Classification  classification  =   classifier.classify(text);
    String  bestCategory    =   classification.bestCategory();
    System.out.println("Best    Language:   "   +   bestCategory);

        }
}

~7.I tried with a 1 million metadata file, but it got the same result and also changing the ngram number by getting the same results. I will be so thankfull for your help.


Solution

  • Well, after days working in Natural Language Processing I found a way to determine the language of one text using OpenNLP. Here is the Sample Code: https://github.com/samuelchapas/languagePredictionOpenNLP/tree/master/TrainingLanguageDecOpenNLP

    and over here is the training Corpus for the model created to make language predictions.

    I decided to use OpenNLP for the issue described in this question, really this library has a complete stack of functionalities. Here is the sample for model training>

    https://mega.nz/#F!HHYHGJ4Q!PY2qfbZr-e0w8tg3cUgAXg