I want to count the number of words in .fdt/.fdx/.fdxt file
I converted .fdxt to .html then further parsed it. Its was successful in some cases but not all.
String html="";
Scanner sc = new Scanner(new File("/home/de-10/Desktop/1.html"));
while(sc.hasNextLine()) {
html+=sc.nextLine();
}
sc.close();
System.out.println(html);
Document doc = Jsoup.parse(html.toString());
String data = doc.text();
System.out.println(data);
Scanner sc1 = new Scanner(new String(data));
int wordCount=0;
while(sc1.hasNext()) {
sc1.next();
wordCount++;
}
sc1.close();
System.out.println("");
System.out.println("**********");
System.out.println("WordCount: "+wordCount);
System.out.println("**********");
System.out.println("");
I'm looking for some optimal solution.
You said, " It was successful in some cases but not all". So I suggest removing the punctuation from the text before counting.
int wordCount = Jsoup.parse(html).text().replaceAll("\\p{Punct}", "").split("\\s+").length;