javaregexstring

Removing Multi Ocurrences of a Word


How to remove the multiple occurrences of words in a String? The hard thing here is, I don't know which word it is. See below examples.

This is how how I tried to split a paragraph into a sentence sentence But, there is a problem My paragraph includes dates dates dates dates like Jan 13, 2014 , words includes like U S and numbers

Here, some words have multiple occurrence. Words like sentence, dates, includes and how have occurred more than once. Note than this repeat may not occur near to each other, like includes. I want to remove these so it will be like below.

This is how I tried to split a paragraph into a sentence But, there is a problem My paragraph includes dates like Jan 13, 2014 , words like U S and numbers

Note that removing multi occurrence does not mean removing all occurrences of the multi occurred word. It will simply keep a one copy and remove the rest.

Just like the above, there will be very big Strings which I have no idea about which word has occurred more than once. How can I make this happen?


Solution

  • Copy the text one word at a time and ignore the duplicates along the way. Use a hashset to keep track of the duplicates.

    Something like this...

    String text = "This is how how I tried to split a paragraph into a sentence sentence But, there is a problem My paragraph includes dates dates dates dates like Jan 13, 2014 , words includes like U S and numbers"; 
    StringBuilder result = new StringBuilder();
    HashSet<String> set = new HashSet<String>();
    for(String s : text.split(" ")) {
        if (!set.contains(s)) {
            result.append(s);
            result.append(" ");
            set.add(s);
        }
    }
    System.out.println(result);
    

    You'll have to touch it up a little to handle the punctuation properly, but that should get you started,.