[SOLVED] German Novel with DkPro

German Novel with DkPro

I tried German Novel with DkPro. My Sample input file is an XHTML file. How can I get my PosTagger output based on the XHTML index.

Script:

 PACKAGE com.github.uima.ruta.novel;
 ENGINE utils.HtmlAnnotator;
 ENGINE utils.HtmlConverter;
 ENGINE utils.ViewWriter;
 TYPESYSTEM utils.HtmlTypeSystem;
 TYPESYSTEM utils.TypeSystem;

 IMPORT PACKAGE de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos FROM desc.type.POS;
 IMPORT de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Lemma FROM desc.type.LexicalUnits;

 UIMAFIT org.dkpro.core.opennlp.OpenNlpSegmenter;
 UIMAFIT org.dkpro.core.stanfordnlp.StanfordPosTagger;

 CONFIGURE(HtmlAnnotator, "onlyContent" = false);
 Document{-> EXEC(HtmlAnnotator)};
 Document { -> CONFIGURE(HtmlConverter, "inputView" = "_InitialView","outputView" = "plain"),
 EXEC(HtmlConverter,{TAG})};

 "<\\?xml version=\"1.0\" encoding=\"UTF-8\"\\?>"->MARKUP;
 uima.tcas.DocumentAnnotation{-CONTAINS(POS)} -> {
 uima.tcas.DocumentAnnotation{-> SETFEATURE("language", "de")};
 EXEC(OpenNlpSegmenter);
 EXEC(StanfordPosTagger, {POS});
 };

Sample Input

 <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head xmlns="http://www.w3.org/1999/xhtml"><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><meta name="viewport" content="width=device-width, initial-scale=1.0" /><style></style><title></title></head><link xmlns="http://www.w3.org/1999/xhtml" src="./ckeditor.css" /><body xmlns="http://www.w3.org/1999/xhtml"><div class="WordSection1"><p class="Normal" data-name="Normal"><span data-bkmark="para10000"></span><span style="font-size:9pt">Der Idiot</span><span data-bkmark="para10000"></span></p>
 <p class="Normal" data-name="Normal"><span data-bkmark="para10001"></span><span style="font-size:9pt">Ein Roman in vier Teilen.</span><span data-bkmark="para10001"></span></p>
 </div>
 <hr align="left" size="1" width="33%" /></body>
 </html>

In the sample script, uima.tcas.DocumentAnnotation is sent to PosTagger Process. The MARKUP in this annotation affecting the accuracy. What I need to do to get the accuracy.

Solution

The HtmlAnnotator can be used to hide additional MARKUP so that rules are not affected by them. The HtmlConverter is able to create a new document text without html/xml markup, but only in a new CAS view as the initial text in a CAS is static and cannot be changed. The EXEC action is able to apply an external analysis engine on the current CAS object, and it can be configured to be applied on a different CAS view. However, the external analysis engine is applied on the complete CAS including the markup. No new CAS is created on the fly.

There are several options what you could do.

You could apply the pos tagger on the ‘plain’ view, but you cannot access these annotation with rules as the annotation will be present in a different view
You setup a multi view setting, e.g, by a two stage process. First convert the text to plain text without markup, and then apply the pos tagger on the new text
Depending on the external analysis engine, you maybe can also solve this by redefining what a token is.