I need to remove punctuation reading on a file, maintaining accents character I tried this code but don't work how I would.
Expectation: input=> ’'qwe..,rty ‘èeéò’“ ”o" "à output=> qwertyèeéòoà
Effective result: input=> ’'qwe..,rty ‘èeéò’“ ”o" "à output=>’qwerty ‘èeéò’“ ”o" "à
I can't remove ’“”
symbols and other of these
Note: Eclipse
and filetext.txt
are set to UTF-8
.
Thank you
import java.io.*;
import java.util.Scanner;
public class DataCounterMain {
public static void main (String[] args) throws FileNotFoundException {
File file = new File("filetext.txt");
try {
Scanner filescanner = new Scanner(file);
while (filescanner.hasNextLine()) {
String line = filescanner.nextLine();
line=line.replaceAll ("\\p{Punct}", "");
System.out.println(line);
}
}
catch(FileNotFoundException e) {
System.err.println(file +" FileNotFound");
}
}
}
The regex \p{Punct}
only matches US-ASCII punctuation by default, unless you enable Unicode character classes. That means that your code, as written, would only remove these characters:
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
If you want to match everything the Unicode Consortium classified as punctuation, try \p{IsPunctuation}
instead, which always checks Unicode character properties and matches all the punctiuation in your example (and more!).
To replace whitespace as well as punctuation, like in your example, you would use:
line = line.replaceAll("\\p{IsPunctuation}|\\p{IsWhite_Space}", "");