i am writing a project on plagiarism detection with Java, in this case for the first step i need to do the following tasks :
inputing file (txt, .pdf, .doc)
convert the file content to text
removing stop words tokenizng into n-gram
processing the text-similarity algorithms on the texts
reporting plagiarism detection signs
i did these steps by coding myself, but now i feel a lot of performance lacks in it, so i started using available API es for my work, is there any one who has worked with ws4j library ? any Docs or helps available for it? i couldt reuse it. it is exactly what i want, look at the demo
Apart from what you can see on the website, there is no documentation that I could find. I suggest you start by looking at the code (use SVN or git to check it out). Please note that you'll need the binary distribution, because the source is not complete.
The simple tutorial works for most cases. You've probably already found it in the source code:
double s = rc.calcRelatednessOfWords("jump", "stand");
If you want to compare specific synsets, you'll have to create a Concept
first. Example for most common sense of "jump":
String word = "jump";
List<Synset> synsets = WordNetUtil.wordToSynsets(word, POS.v);
Synset mysynset = synsets.get(0);
Concept co = new Concept(mysynset.getSynset(), POS.v, mysynset.getName(), mysynset.getSrc());
The library doesn't actually work like the online demo. To use the typical notation for synsets, I use my own utility method. So comparing the specific synsets looks like this:
Concept stand = new Concept(getSynset("stand#v#1"), POS.v);
Concept jump = new Concept(getSynset("jump#v#1"), POS.v);
double score = compare(comparer, co, stand);
// done!
// utility
private static double compare(RelatednessCalculator comparer, Concept one,
Concept other) throws Exception {
Relatedness res = comparer.calcRelatednessOfSynset(one, other);
if(StringUtils.isNotBlank(res.getError()))
{
throw new Exception ("WordNET similiarity for " + one + " and " + other + " failed with this error: "+ res.getError() + "\n" + res.getTrace());
}
return res.getScore();
}
/**
*
* @param wordnetword a string of the format lemma#pos#num. E.g. jump#v#1 or house#n#2
* @return a synset identifier for WS4J
*/
private static Concept getSynset(String wordnetword) {
String[] parts = StringUtils.split(wordnetword, "#");
String lemma = parts[0];
POS mypos = POS.valueOf(parts[1]);
int index = Integer.parseInt(parts[2]) - 1;
List<Synset> synsets = WordNetUtil.wordToSynsets(lemma, mypos);
Synset synset = synsets.get(index);
String synstring = synset.getSynset();
return new Concept(synstring, mypos, lemma, synset.getSrc());
}