javatwittertextfilteringdirty-data

How to clean dirty text using java


I am working on collecting data from twitter and making processing on it, but i have the problem that: text is dirty,

example :

String dirtyText="this*is#a*&very_dirty&String";

example :

String dirtyText="All f dis happnd bcause u gave ur time, talent n passion.";

please i want it as simple as possible.


Solution

  • This is not an easy problem to solve. All f dis happnd could be "cleaned" to produce All *of* this happened or All *if* this happened. For the first example, you can merely replace all non-alphabetic characters with spaces. See this question for how to do that.

    Otherwise I think you would need a natural language processor, or at the very least a spell checker. To guess what a Tweet should be in correct english is an extremely complex problem to solve. Take a look at Jazzy for an open source spell checker.